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© Parallel computer comprised of processor elements having a local memory and an enhanced data 
transfer mechanism. 



© In a parallel computer, there are provided a 
plurality of processor elements (1-1 to 1-n) con- 
nected to each other by a network (2); each of said 
processor elements including a local memory (6) for 
holding a program and data related thereto, a pro- 
cessor (3) for performing an instruction in said pro- 
gram, a circuit (5) for transferring the data to the 
other processor elements, and a circuit (4) for re- 
ceiving the data sent from the other processor ele- 
ments; a memory area constructed of a plurality of 
reception data areas for temporarily storing data 
CO received by said receiving circuit (4), and a memory 
^constructed of a plurality of tag areas, provided for 
^.each of the reception data areas, for storing a valid 
CO data tag or an invalid data tag indicating that the 
^■data in the corresponding reception data area is 
(q valid or invalid; a transmitting circuit (5) for transmit- 

CM 
CO 



ting the data to be transmitted with attaching a data 
identifier predetermined by said data; a receiving 
circuit for writing the data into one of said plurality of 
reception data areas in response to the data re- 
ceived from said network, and writing the valid data 
tag into one of said plurality of reception data areas, 
said receiving circuit being parallel-operated with 
said processor; and, an access circuit for reading 
both the data and tag from one of the reception diata 
areas determined by said data identifier and from 
the corresponding tag areas in response to the data 
identifier designated by the instruction which is pro- 
duced from said program for requiring the data re- 
ception, and for repeatedly reading the tag and data 
from the tag area and reception data area until the 
valid data tag is read out from the tag area in case 
that the read tag corresponds to the invalid data tag. 
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© Parallel computer comprised of processor elements having a local memory and an enhanced data 
transfer mechanism. 



© In a parallel computer, there are provided a 
plurality of processor elements (1-1 to 1-n) con- 
nected to each other by a network (2); each of said 
processor elements including a local memory (6) for 
holding a program and data related thereto, a pro- 
cessor (3) for performing an instruction in said pro- 
gram, a circuit (5) for transferring the data to the 
other processor elements, and a circuit (4) for re- 
ceiving the data sent from the other processor ele- 
ments;, a memory area constructed of a plurality of 
N reception data areas for temporarily storing data 
^ received by said receiving circuit (4), and a memory 
^constructed of a plurality of tag areas, provided for 
CO each of the reception data areas, for storing a valid 
*™data tag or an invalid data tag indicating that the 
{©data in the corresponding reception data area is 
N valid or invalid; a transmitting circuit (5) for transmit- 
f ting the data to be transmitted with attaching a data 
©identifier predetermined by said data; a receiving 
ft circuit for writing the data into one of said plurality of 
yj reception data areas in response to the data re- 
ceived from said network, and writing the valid data 
tag into one of said plurality of reception data areas, 



said receiving circuit being parallel-operated with 
said processor; and, an access circuit for reading 
both the data and tag from one of the reception data 
areas determined by said data identifier and from 
the corresponding tag areas in response to the data 
identifier designated by the instruction which is pro- 
duced from said program for requiring the data re- 
ception, and for repeatedly reading the tag and data 
from the tag area and reception data area until the 
valid data tag is read out from the tag area in case 
that the read tag corresponds to the invalid data tag. 
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PARALLEL COMPUTER COMPRISED OF PROCESSOR ELEMENTS HAVING A LOCAL MEMORY AND AN 

ENHANCED DATA TRANSFER MECHANISM 



Field of the Invention 

The present invention relates to a parallel com- 
puter constructed of a plurality of processor ele- 
ments. 



Description of the Related Art 

As one prior art parallel computer arranged by 
a plurality of processor elements, a first type of 
parallel computer is known such that each of these 
processor elements includes a local memory for 
storing a program executed therein and also data, 
and the respective processor elements can access 
to local memories of other processor elements, if 
required. In such a sort of parallel computer, when 
one processor element transfers data to a local 
memory of the other processor element, after the 
processor element for transferring the data writes 
the data into the local memory of the data receiv- 
ing processor element, this data transferring pro- 
cessor element interrupts the data receiving pro- 
cessor element in order to assure the reference 
order of the data. A reference is made, as the 
relevant prior art parallel computer, to, for instance, 
"IEEE, PROCEEDINGS OF THE 1985 INTERNA- 
TIONAL CONFERENCE ON PARALLEL PRO- 
CESSING", pages 782 to 788. 

On the other hand, as a second type of the 
conventional parallel computer, it has been known 
such a parallel computer that is arranged by a 
plurality of processor elements coupled with a 
common memory. In such a conventional parallel 
computer, a tag representing whether the data are 
valid (i.e., have been written) or invalid (i.e., have 
not yet been written) is applied to each word of the 
common memory. The data communication is per- 
formed between the relevant processor elements 
by utilizing this tag. In other words, when data is 
transferred from one processor element to the oth- 
er processor element, the processor element of the 
data transfer side writes the data into the common 
memory and then the tag for this written data is 
changed into the condition "the data has been 
written". 

The processor element of the data receiving 
side checks whether or not the tag employed for 
the memory position thereof indicates that "the 
data is valid" in order to judge whether or not the 
data to be read is present in the common memory. 
If the checked tag indicates that "the data is valid", 
this data is read out and thereafter the above- 
described tag is changed at the proper timing into 



"the invalid data condition". As a result, the pro- 
cessor element of the data transfer side can send 
the data to the processor element of the data 
reception side without an intrruption for the latter- 

5 mentioned processor element. Such a sort of par- 
allel computer is described in, for instance, "REAL- 
TIME SIGNAL PROCESSING IV, VOL 298" August 
1981, pages 241 to 248. 

In the above-described first type of the conven- 

io tional parallel computers, since the data calculation 
is carried out by utilizing the local memory by the 
respective processor elements, there Is very little 
restriction on the arrangement of the parallel com- 
puters when the number of the processor elements 

is is increased. It is therefore relatively easy to in- 
crease the number of the processor elements for 
constructing such a sort of the parallel computers. 
However, when the data is transferred from one 
processor element to the other processor element. 

20 the above-described interruption process operation 
is required at the data receiving processor element. 
Such an overhead operation may considerably low- 
er the overall performance of the parallel comput- 
ers. 

25 In the second type of the conventional parallel 

computers, on the other hand, there are the follow- 
ing drawbacks. That is, although the above-de- 
scribed overhead operation such as the interruption 
process operation is not required, the common 

30 memory is accessed by all of the processor ele- 
ments, so that there is a great delay in the access 
time because the memory accessing operations for 
all of the processor elements compete with each 
other. As a result, due to this delay access time, it 

35 is difficult to employ a large quantity of processor 
elements in such a parallel computer. As a con- 
sequence, the second type of the parallel computer 
having the high-speed performance can be hardly 
realized. 

40 As previously described, there are problems in 

the above-described first and second types of the 
conventional parallel computers. To solve these 
conventional problems, a third type of a parallel 
computer has been proposed by the Applicants, 

45 which is disclosed in the copending Japanese pat- 
ent applications Nos 61-182361 (filed on August 1. 
1986), and 62-56507 (filed on March 13. 1987), or 
the corresponding US patent application serial 
No.78656 (filed on July 28. 1987), or the cor- 

50 responding EPC patent application No. 87111124.1 
(filed on July 31, 1987). 

In the third type of the parallel computer, no 
common memory causing the drawback of the 
second type of the parallel computer is employed, 
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and the local memories same as those of the first 
type of the parallel computer are employed for the 
respective processor elements. However, the data 
reception buffers different from the local memories 
are additionally required for the respective proces- 
sor elements in order to prevent the interruption 
process operation during the data communication. 

In this third type of the parallel computer, when 
the data is transferred, the processor element of 
the data transfer side simultaneously transfers both 
the data and the identifier for identifying the data to 
the other processor element of the data reception 
side. The transferred data is stored in the reception 
buffer of the processor element of the data recep- 
tion side. In the processor element of the data 
reception side, the reception buffer is associative- 
retrieved based upon the identifier which has been 
previously determined for the data in question 
when this data is required to be accessed, and the 
data in question is read out therefrom if this data is 
present therein. As a result, the above-described 
interruption process operation conducted into the 
first type of the conventional parallel computer as 
the major drawback thereof is no longer performed. 

To furthermore improve the performance of the 
third type of the parallel computer having the sat- 
isfactory performance, the Applicants have filed 
another Japanese patent application No. 62-12359 
on January 23, 1987 and the corresponding US 
patent application serial No. 145614 on January 19, 
1988 relating to a fourth type of a parallel com- 
puter. In accordance with this fourth type of the 
parallel computer, there are the following particular 
advantages. That is, in the above-described third 
type of the parallel computer, the reception proces- 
sor must be programmed that when plural pieces 
of the data are required, each of the data must be 
read out from the reception associative memory in 
a predetermined sequence for the respective data. 
In general, however, since the plural pieces of the 
data are transferred from a plurality of transmission 
processor, there is no clear discrimination in the 
data reception order by which these pieces of the 
data have been received at the reception associ- 
ative memory. As a result of, for instance, the 
calculation by other processors, four pieces of 
data, i.e., A, B, C, and D are received. When the 
data representative of the maximum value among 
these four pieces of reception data is retrieved by 
the reception processor, if the maximum value re- 
trieval program by the reception processor is so 
designed that the data of A, B, C and D are 
successively received from the reception associ- 
ative memory in this order, and a comparision is 
made between the newly received data and the 
previously received data so as to find out the 
maximum value, the reception processor cannot 
proceed with the above-described maximum value 



retrieval program even if the data of B, C, or D has 
been received by the reception buffer prior to the 
reception of the data of A. As a consequence, it is 
desired to realize the fourth type of the parallel 
5 computer with solving the above-described 
drawbacks. 

In accordance with the above-described fourth 
type of the parallel computer, as the identifiers for 
one group of the data to be processed as a whole, 

w both the main identifier commonly used for the 
above data group and the sub-identifiers specific to 
the respective data are determined; when the data 
is transferred from one processor element to the 
other processor element, this identifier is attached 

is to this data, and in the reception processor ele- 
ment, the main identifier is designated and the data 
having this designated main identifier is read out 
from the reception associative memory. As a con- 
sequence, even when any one of the data belong- 

20 ing to the same data group, the maximum value 
retrieval process can be performed without receiv- 
ing other data. 

In the above-described third type of the parallel 
computer, since the associative memory is em- 

25 ployed as the reception buffer and the .associative 
memory is of the specific construction, it. is gen- 
erally difficult to obtain a large memory capacity of 
this associative memory and the total cost of such 
an associative memory becomes, high. Conse- 

30 quently, there is a problem in cost if the reception 
buffer having the large memory capacity is ar- 
ranged by the asociative memory. In addition, 
when the number of the processor element is re- 
quired to be increased, the resultant cost required 

35 for employing the associative memory is increased. 



Summary of the Invention 

40 It is therefore an object of the present invention 

to provide a parallel computer having a simpler 
construction, where a local memory is provided in 
each of processor elements under the condition 
that no interruption process operation is required 

45 during the data transfer between the processors. 

Still another object of the present invention Is 
to provide a parallel computer having a simpler 
arrangement, where one group of the data can be 
processed in accordance with the data reception 

so order, not the order allocated to these data. 

To achieve the above-described first object of 
the present invention, a parallel computer accord- 
ing to the present invention is characterized by 
comprising: a processor element which includes a 

55 memory area constructed of a plurality of reception 
data areas for temporarily storing data received by 
said receiving means, and a memory area (92,8) 
constructed of a plurality of tag areas, provided for 
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each of the reception data areas, for storing a vaild 
data tag or an invalid data tag indicating that the 
data in the corresponding reception data area is 
valid or invalid; a transmitting means (5) for trans- 
mitting the data to be transmitted with attaching a 
data identifier predetermined by said data; a re- 
ceiving means for writing the data into one of said 
plurality of reception data areas in response to the 
data received from said network, and writing the 
valid data tag into one of said plurality of reception 
data areas, said receiving means being parallel- 
operated with said processor; and, an access 
means (38) for reading both the data and tag from 
one of the reception data areas determined by said 
data identifier and from the corresponding tag 
areas in response to the data identifier designated 
by the instruction which is produced from said 
program for requiring the data reception, and for 
repeatedly reading the tag and data from the tag 
area and reception data area until the valid data tag 
is read out from the tag area in case that the read 
tag corresponds to the invalid data tag. 

More specifically, to achieve the above-de- 
scribed object of the invention, there are provided: 
validating means for setting the content of the tag 
attached to the word in such a manner that it 
indicates "valid data" when the data is written from 
an arbitrary processor element to the tag, where 
the tag is employed for the word of the local 
memory, and the tag attached to the word repre- 
sents whether the data holding this word is valid or 
invalid; tag accessing means for continuing an in- 
spection until the content of the tag indicates "valid 
data"; and invalidating means for setting the con- 
tent of the tag attached to the word in such a 
manner that it indicates "invalid data" when the 
word is read out from the processor element hold- 
ing the word. 

In addition, the above-described objects of the 
present invention can be achieved by another pre- 
ferred embodiment. The above objects may be 
achieved by a parallel computer having a local 
memory and a plurality of processor elements in- 
dependently operable, where a reception memory 
is provided which is writable from an arbitrary 
processor element of each of the processor ele- 
ment of each of the processor elements, a tag 
memory area is employed with respect to the re- 
spective data memory areas of the reception mem- 
ory; and validating means, tag accessing means, 
and invalidating means are employed with respect 
to the reception memory. 

Also the above-described objects xrf the inven- 
tion may be achieved by a parallel computer where 
the data to be sent to the other processor is 
transferred with respect to an address of a local 
memory in a destination processor which is pro- 
duced from a main group belonging to this data 



and also a sub-identifier for identifying the data to 
be sent from other data in the data group; the data 
is sotred in the address of the designated local 
memory and simultaneously the corresponding tag 
s , is brought into a valid condition in the reception 
processor; when the reception data is read out 
from the local memory, based upon the main iden- 
tifier for the data retreival purpose, the address of 
the reception data group is generated, and then the 

10 desired reception data is read from the local mem- 
ory based upon this address, and furthermore, the 
sub-identifier corresponding to the reception data is 
generated from the read address, and finally both 
the reception data and sub-identifier are fetched. 

15 Moreover, to achieve the above-described ob- 

jects of the present invention, the mutually related 
data group can be received in the order of the data 
arrivals independent from the execution of the 
instructions by the processor in view of the execu- 

20 tion of the program at the processor of the data 
reception side. In addition, since the confirmation 
whether or not all of the mutually relevant data 
groups have been arrived is performed by the 
number of the data reception, the above-described 

25 objects of the invention can be achieved. 

In the first preferred embodiment, when one 
processor element writes the data into the word 
within the local memory of the other processor 
element, the content of the tag attached to the 

30 word by the above-described validating means re- 
presents that "the data is valid". Similarly, when 
one processor element according to the second 
preferred embodiment writes the data into the word 
in the reception memory of the other processor 

35 element, the content of the tag attached to the 
word by the above validating means indicates that 
"the data is valid". The processor element to read 
out this data, on the other hand, waites for such a 
condition that the content of the tag attached to the 

40 word by the above-described tag accessing means 
represents that "the data is valid" before the 
readout of the word, and then reads out the data 
stored in the word. Accordingly, the content of the 
tag indicates that "the data is invalid" by the 

45 above-described invalidating means. 

With the above-described circuit arrangement, 
the data transfer between the processor elements 
can be correctly and efficiently performed. 

Furthermore, by employing both the main iden- 

so tifier and sub-identifier, the following advantage of 
the preferred embodiment can be obtained when 
the exchange rule, for instance, can be satisfied in 
the process of the reception processor. That is, the 
data required for performing this process is 

55 address-generated by utilizing the main identifier, 
the data which have been arrived at the local 
memory in the order of the data arrivals are 
fetched together with the sub-identifier so that the 
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rest time of the reception processor can be mini- 
mized. 

In addition, the data reception in the processor 
is performed independent to the instruction execu- 
tion of the processor, with the result that the num- 
ber of the instruction execution for receiving the 
data can be reduced. Also since the confirmation 
that the reception of a plurality of data is com- 
pleted is performed by the number of the data, the 
number of the instruction execution required for the 
data reception confirmation can be reduced, as 
compared with that of the case that the data recep- 
tion confirmation is carried out every data. 



Brief Description of the Drawings 

For a better understanding of the invention, 
reference is made to the following descriptions in 
conjunction with the accompanying drawings, in 
which: 

Fig. 1 is a schematic block diagram of an 
overall circuit arrangement of a parallel computer 
according to a. first preferred embodiment of the 
invention; 

Fig. 2 is a detailed circuit diagram of the 
processor 3 employed in the parallel computer 
shown in Fig. 1; 

Fig. 3 is a schematic circuit diagram of the 
memory access circuit 38 employed in the proces- 
sor shown in Fig. 2; 

Rg. 4 illustrates a format of a RECEIVE 
instruction executed in the processor shown in Fig. 
1; 

Fig. 5 illustrates a format of a SEND instruc- 
tion executed in the processor shown in Fig. 1 ; 

Fig. 6 is a schematic block diagram of an 
overall circuit arrangement of a parallel computer 
according to a second preferred embodiment of the 
invention; 

Fig. 7 is a detailed circuit diagram of the 
processor 3A employed in the parallel computer 
shown in Fig. 6; 

Fig. 8 is a schematic block diagram of a 
parallel computer according to a third preferred 
embodiment of the invention; 

Fig. 9 is a detailed circuit diagram of the 
processor employed in the parallel computer 
shown in Fig. 8; 

Fig. 10 is a schematic block diagram of the 
address converting circuit employed in the parallel 
computer shown in Fig. 8; 

Fig. 11 is a schematic block diagram of 
another address converting circuit (110) employed 
in the parallel computer according to the third 
preferred embodiment; 



Fig. 12 shows a relationship between two 
memory spaces employed in the parallel computer 
according to the third preferred embodiment; 

Fig. 13 is a schematic block diagram of an 
5 overall circuit arrangement of another parallel com- 
puter according to a third preferred embodiment; 

Fig. 14 is a schematic block diagram of the 
address converting circuit (104) employed in the 
parallel computer according to the fourth preferred 
70 embodiment; 

Fig. 15 is a schematic block diagram of 
another address converting circuit employed in the 
parallel computer according to the fourth preferred 
embodiment; 

75 Fig. 16 illustrates a relationship between two 

memory spaces employed in the parallel computer 
according to the fourth preferred embodiment; 

Fig. 17 is a schematic block diagram of an 
overall circuit arrangement of a parallel computer 

20 according to a fifth preferred embodiment of the 
invention; 

Fig. 18 is a detailed circuit diagram of the 
processor (3d) employed in the parallel computer 
according to the fifth preferred embodiment; 
25 Fig. 19 is a circuit diagram of. the. address 

generating circuit (1306) employed in , .the parallel 
computer according to the fifth preferred embodi- 
ment; 

Fig. 20 is a circuit diagram of the memory 
30 access circuit (1302) employed in the processor 
shown in Fig. 18; 

Fig. 21 shows an instruction format of a V 
RECEIVE instruction employed in the parallel com- 
puter according to the fifth preferred embodiment; 
35 Rg. 22 illustrates a format of a V SEND 

instruction employed in the parallel computer ac- 
cording to the fifth preferred embodiment; 

Fig. 23A represents a format of a VSENDL 
instruction; 

40 Rg. 23B indicates a format of a VRECEIVEL 

instruction; 

Rg. 24 is a schematic diagram of a parallel 
processor according to a sixth preferred embodi- 
ment of the invention; 
45 Fig. 25 is a detailed circuit arrangement of 

the address generating unit shown in Fig. 24; 

Fig. 26 is a detailed circuit arrangement of 
the reception control shown in Rg. 24; 

Fig. 27 is a detailed circuit arrangement of 
so the vector process unit shown in Fig. 24; 

Fig. 28 is a schematic block diagram of a 
parallel processor according to a seventh preferred 
embodiment of the invention; 

Fig. 29 is a schematic, block diagram of a 
55 parallel processor according to an eighth preferred 
embodiment of the present invention; 

Fig. 30 is a detailed circuit arrangement of 
the address generating unit shown in Fig. 29; 
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Fig. 31 is a detailed circuit arrangement of 
the data reception shown in Fig. 29; 

Fig. 32 is a schematic block diagram of a 
parallel processor according to a ninth preferred 
embodiment of the invention; and, 

Fig. 33 is a detailed circuit arrangement of 
the counter circuit shown in Fig. 32. 



Detailed Description of the Preferred Embodiments 



PARALLEL COMPUTER ACCORDING TO A FIRST 
PREFERRED EMBODIMENT 

Fig. 1 is a schematic block diagram of a par- 
allel computer according to a first preferred em- 
bodiment of the invention. In Fig. 1, reference 
numerals 1-1 to 1-n indicate "n" pieces of proces- 
sor elements which can be independently oper- 
ated. Reference numeral 2 denotes a network 
which receives a data transmission instruction pro- 
duced from an arbitrary processor element se- 
lected from the processor elements 1-1 through 1- 
n t and transfers the data to the designated proces- 
sor element 

Then, circuit arrangements of the processor 
elements 1-1 to 1-n will be described. Since these 
processor elements 1-1 to 1-n have the same cir- 
cuit arrangements, only an internal circuit arrange- 
ment of the processor element 1 -1 is shown in Fig. 
1 for the sake of the simplicity. The processor 
element 1-1 is constructed of a processor 3 t a 
receive unit 4, a send unit 5 and a local memory 6. 
According to the feature of the first preferred em- 
bodiment, the local memory 3 is employed so as to 
store the program and also the received data 
(reception data). 

More specifically, the feature of the parallel 
computer shown in Fig. 1 is that when the received 
data is stored into the local memory, the tag repre- 
sentative of the valid data is also stored, and the 
processor 3 judges whether or not the desired data 
has been stored into the local memory 6 based 
upon the value of the tag attached to this data. 

A detailed circuit diagram of the processor 3 is 
illustrated in Fig. 2. In Fig. 2, reference numeral 30 
denotes an instruction fetch circuit, reference nu- 
meral 32 indicates an ALU, reference numeral 33 
represents a tag invalidating circuit, reference nu- 
meral 34 is a general-purpose register group, and 
reference numeral 35 indicates a program counter. 
Also reference numeral 36 denotes an instruction 
register subdivided into a field 36-1 for storing the 
instruction code, and three fields 36-2, 36-3, and 
36-4 for storing an operand. Reference numeral 37 
indicates an instruction decoding controller for con- 



trolling the instruction decoding operation and the 
execution thereof. Reference numeral 38 indicates 
a memory access circuit which is used to execute 
the memory access instruction such as a RECEIVE 

5 instruction (will be discussed later). Although the 
processor 3 is of so-called "Neumann type com- 
puter", this processor 3 can execute two newly set 
instructions in addition to the normal instruction set 
of the Neumann type computer (memory read, 

10 memory write, calculation instruction etc.). 

In the local memory 6 shown in Fig. 1, an 
address is attached as a unit of a word so as to 
store the data having one word length. The local 
memory 6 is constructed of a plurality of memory 

75 areas. Each of the memory areas is arranged by a 
data unit for storing, the data having one unit length 
of a so-called "word", and additionally a tag unit for 
storing a 1 -bit tag. The tag of the respective mem- 
ory areas indicates whether or not the valid data 

20 has been written into the respective memory areas, 
and accordingly has the value 1 or value 0. To the 
local memory 6, both the program to be executed 
by the processor 3 and the data used in this 
program are stored. In the local memory 6, there 

25 are provided; a first input port for setting a line L3 
to an input line of an address, a line L4 to an input 
line for a demand signal of the data and data write 
instruction, and a line L5 to an input line for an 
instruction signal of the tag and tag write instruc- 

30 tion; a second input port for setting a line L6 to an 
input line of the address, and a line L7 to an input 
line for a demand signal of the tag and tag write 
instruction; a third input port for setting a line L8 to 
an input line for a demand signal of the address, 

35 tag and data readout instruction, and a line L9 to 
an input line for a demand signal; and a fourth 
input port for setting a line L12 to an input line for a 
demand signal of the address, and data readout 
instruction, and a line L13 into an output line of the 

40 data. In case that more than two demands from the 
first port, second port and third port have simulta- 
neously arrived, the local memory 6 adjusts prop- 
erly these demands and responds to the adjusted 
demands. 

45 The send unit 5 transmits a packet containing 

the data to be written into the local memory when 
this data is sent from each of the processor ele- 
ments to this local memory of the other processor 
element. This send unit 5 includes a send register 

so 50 therein. The send register 50 is divided into 
fields 50-1, 50-2 and 50-3. These fields are em- 
ployed so as to store a destination (processor 
element number), an address used as the data 
identifier and data, respectively. It should be noted 

55 that the address indicates that the data is stored 
into the local memory within the destinated proces- 
sor element. 

The receive unit 4 corresponds to an apparatus 
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for receiving the data as well as the address trans- 
ferred via the network 2 and line L1 from the other 
processor element, and for writing this received 
data into the memory area, designated by the 
received address, of the local memory 6. The re- 
ceive unit 4 includes a receive register 40 therein. 
The internal arrangement of this receive register 40 
is divided into two fields 40-1 and 40-2. Each of the 
fields stores the address and data sent from the 
network 2, respectively. Reference numeral 41 in- 
dicates a write controller which outputs both the tag 
having a value of 1 and the demand signal of tag 
write instruction to the line L5 when the data is 
written from the network 2 into the receive register 
40. 

in the processor 3 within the respective proces- 
sor elements, the instruction fetch circuit 30 out- 
puts the content of the program counter PC 35 and 
the demand signal of data readout instruction from 
the line L12 into the fourth input port of the local 
memory 6. The instruction designated by the pro- 
gram is read out from the local memory 6 and set 
via the line L13 into the instruction register 36. The 
instruction decoding controller 37 decodes the in- 
struction code stored into the field 36-1 among the 
instructions which have been set into the instruc- 
tion register 36, and distributes the signai for ex- 
ecuting the operation designated by this instruction 
to the internal circuit of the processor 3, in order to 
operate the ALU 32, general-purpose register 
group 34 and so on. When the operation des- 
ignated by the instruction is accomplished, the 
instruction decoding controller 37 updates the val- 
ue of the program counter 35 from the line L 302. 
The packet to be sent by the respective processor 
elements is produced by this processor 3 and then 
sent to the send unit 5. Also the data received from 
the other processor element into the local memory 
6 is processed by this processor 3. The above- 
described series of the computer operation will be 
repeatedly performed. 

When the packet is set into the send register 
50 within the send unit 5, the network 2 transfers 
the address and data stored in the fields 50-2 and 
50-3 of the receive unit 4 provided in the destina- 
tion processor element 1-i (i = 1, 2, — or "n") which 
is denoted by this field 50-1. 

Now, a description will be made to a detailed 
operation of the parallel computer with reference to 
the newly introduced "SEND instruction" and 
"RECEIVE instruction". This SEND instruction cor- 
responds to an instruction by which the data stored 
into the processor element which has executed this 
instruction is written into the local memory. Fig. 4 
illustrates a format of the SEND instruction. This 
SEND instruction has the following three operands. 

1. destination (destination processor element 
number) 



2. local memory adress (address of send 
data storing area within local memory for destina- 
tion processor element). 

3. send data 

5 Each of these operands is stored in the general- 
purpose register designated by each of two instruc- 
tions of R1, R2 and R3 fields. This instruction 
implies that the data designated by the third 
operand is written into the address, designated by 

w the second operand, of the local memory of the 
processor element designated by the first operand. 
When this insturction is executed, the parallel com- 
puter according to the first preferred embodiment 
is operated as follows. 

75 First, the instruction decoding controller 37 of 

the processor 3 transfers the register numbers R1 
to R3 instructed by the instruction via the lines 
L305 and L307 to the general-purpose register 
group 34 when this SEND instruction is set to the 

20 instruction register 36. As a result, the destination, 
address and data are read out via the line L309, 
and set as a packet to the fields 50-1, 50-2 and 50- 
3 of the send register 50 via the line L1 1 together 
with the write demand signal "W" generated by the 

25 instruction decoding controller 37. The operations 
of the SEND instruction are completed with the 
above-described manner. The subsequent opera- 
tion is carried out as follows. 

When the packet is set in the send register 50, 

30 the contents of the registers 50-2 and 50-3 are set 
in the register 40 in the receive unit 4 of the 
processor element designated by the field 50-1 of 
this packet by the operation of the send unit 5 and 
network 2, as previously described. As a result, the 

35 write controller of the send unit 4 of the processor 
element into which the data has been set gen- 
erates the tag "T" having a value of "1" and the 
write signal, and writes the data of the field 40-2 
into the data unit of the memory area designated 

40 by the address in the field 40-1 from the first port 
of the local memory 6, and finally writes the value 
of 1 (i.e., the value indicating that the content of 
this word is valid) into the tag unit of this word. 
As previously described, according to the first 

45 preferred embodiment of the invention, when the 
data is sent from one processor element to the 
other processor element, a determination is pre- 
viously made, this data should be stored into a 
predetermined memory area of the local memory 

50 of the processor element, and the data received by 
the receive unit 4 is directly written into the local 
memory. As a consequence, while the data is 
transferred from one processor element to the oth- 
er processor element, no interruption operation is 

55 performed in the processor element at the data 
reception end. 

Then, the newly set instruction "RECEIVE in- 
struction" will now be described. This RECEIVE 
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instruction is an instruction to read the data from 
the local memory 6 of the processor element 
where this RECEIVE instruction is executed. Fig. 5 
illustrates a format of the RECEIVE instruction. The 
RECEIVE instruction has two operands. 5 

1 . local memory address used as data iden- 
tifier (address of storage position for received data 
to be read from local memory) 

2. register number (number of general-pur- 
pose register into which this data should be 10 
stored). 

Each of these operands has been stored into the 
general-purpose register designated by the R1 and 
R2 fields of this instruction. This instruction implies 
that the valid data is read out from the address of is 
the local memory designated by the first operand, 
and this read data is stored into the general-pur- 
pose register designated by the second operand. 
When this instruction is executed, the parallel com- 
puter according to the preferred embodiment is 20 
operated as follows. 

First, when this instruction is set into the in- 
struction register 36, the instruction decoding con- 
troller 37 of the processor 3 outputs the register 
numbers R1 and R2 designated by this instruction 25 
via the respective lines L305 and L306 to the 
general-purpose register group 34, and also out- 
puts the signal representative of the execution of 
the RECEIVE instruction to the line L303 so as to 
initialize the memory access circuit 38 until the 30 
address of the local memory 6 is read out from the 
numbered "R1 " general-purpose register within the 
general-purpose group 39. 

As shown in Fig. 3, in the memory access 
circuit 38, the flip-flop 399 is set by this signal 35 
L303. Reference numeral 401 denotes a circuit for 
generating a readout demand and generates re- 
peatedly the readout demand in synchronism with 
the operation of the local memory 6. When the flip- 
flop 399 is set, the AND gate 400 is opened and 40 
the readout demand R is output from the circuit 
401 to the line L301. These address and readout 
demand signal are transferred via the line L8 to the 
third port of the local memory 6. 

As a consequence, in response to this readout 45 
demand signal the local memory 6 reads both the 
data and tag from the memory area designated by 
this address, and outputs them to the lines L9 and 
L10, respectively. The received data which have 
been output to the line L10 is set into the general 50 
purpose register designated by the second 
operand. Although the tag T which has been read 
on the line L9 is input into the memory access 
circuit 38, when this value is equal to M 0 W , the flip- % 
flop 399 (see Fig. 3) is not reset in the memory 55 
address circuit 38. The memory readout demand 
signal is again generated from the readout demand 
generating circuit 401 (see Fig. 3) to the line L301 



so as to repeat the above-identified memory ac- 
cess. When the tag output on the line L9 is equal 
to "I", since the value of "1" indicative of the 
instruction decoding operation has been input from 
the line L303 in the AND circuit 300 within the 
invalidating circuit 33, the write demand is output 
from the AND circuit 300. On the line L7, the value 
of M 0" is continuously output which is used for 
being written as the invalid tag. Both the address of 
the local memory which have been output on the 
line L6, and the invalid tag value of "0" on the line 
L7, and the write demand send from the AND 
circuit 300 are output to the second input port, 
whereby the tag of the memory area designated by 
this address becomes "0" (the value indicates that 
the data is invalid). In case that the tag on the line 
L9 becomes 1, the flip-flop 399 (Fig. 3) is reset, 
and the tag T is given by the line L302 to the 
program counter 35 so as to update the content of 
the program counter 35 in conjunction with the 
above-described operation in the memory access 
circuit 38. With the above-described operation, the 
RECEIVE instruction is accomplished. 

According to the above-described process op- 
erations, the received data which has been read in 
the general-purpose register is processed by the 
processor 3. Employing, for instance, the calcula- 
tion instruction, the calculation is executed in the 
calculation is executed in the calculator 32 with 
respect to this data. When either the result of this 
calculation operation, or the original received data 
is written into the local memory 6, the known 
memory store instruction is performed. The instruc- 
tion decoding controller 37 is so designed as to 
generate the memory write signal on the line L308 
when decoding this instruction. At this time, the 
data to be written is read from the first general- 
purpose register designated by this instruction on 
the line L315, and the memory address to be 
written is output on the line L10 from the second 
general-purpose register. The above-described 
write signal, address, and data are sent via the line 
L10 to the local memory 6, and written therein. 
Also the memory read instruction in the program is 
performed in a similar manner to the conventional 
parallel computer. When the instruction decoding 
controller 37 decodes this instruction, the read de- 
mand is output to the line L304. On the other hand, 
the memory address is output from the first 
general-purpose register designated by this instruc- 
tion on the line L8. The read demand is transferred 
together with this address to the local memory 6 
via the OR gate 31. When the data is read out from 
this local memory, the read data is stored in the 
second general-purpose register designated by this 
instruction. As described above, the tag is ne- 
glected during either the normal memory store 
instruction, or memory read instruction. 
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As previously described, in case that each of 
the processor elements constituting the parallel 
computer according to the first preferred embodi- 
ment requires the data transmission from one pro- 
cessor element to the other processor element, 
with respect to the data used for the data transmis- 
sion within the local memory of the processor 
element at the data reception side (this data being 
predetermed by either a programmer or a com- 
plier),the processor element at the data transmis- 
sion side sends the data to the local memory of the 
processor element at the data reception side in 
response to the above-described SEND instruction. 
This operation is performed by programming the 
processor element at the data transmission side. 
Also the processor element at the data reception 
side is so programmed that this processor element 
reads this data in response to the RECEIVE in- 
struction. With this circuit arrangement, the data 
processor element at the data reception side never 
reads the data before the processor element at the 
data transmission side writes the data, so that the 
reference order of the data can be assured. 

As previously described, in the parallel com- 
puter according to the first preferred embodiment, 
when the data is sent from one processor element 
to the other processor element, no interruption op- 
eration is required in the other processor element. 
Since the local memory 6 for storing the program 
is used for temporarily storing the received data, 
no specific memory such as an associative mem- 
ory is needed. 

In the parallel computer according to a first 
preferred embodiment of the invention, although 
when the RECEIVE instruction was executed and 
the tag value was equal to "0". the memory access 
circuit repeatedly accessed the memory so as to 
check the value of the tag, it is possible that the 
value of the tag may be read to accomplish the 
instruction execution when the RECEIVE instruction 
is executed, the resultant values of this instruction 
execution may be set into either a flag register or a 
condition code register, and the RECEIVE instruc- 
tion may be repeatedly performed until the tag 
value becomes 1 by way of the condition brunch 
instruction subsequent to the RECEIVE instruction. 



PARALLEL COMPUTER ACCORDING TO SEC- 
OND PREFERRED EMBODIMENT 

Fig. 6 illustrates a parallel computer according 
to the second preferred embodiment. The second 
preferred embodiment has such a feature, as com- 
pared with that of the first preferred embodiment, 
that a receive memory 8 for temporarily storing the 
received data is separately employed in addition to 
the local memory 6A. The same reference nu- 



merals employed in Fig. 6 indicate the same or 
similar circuit arrangements shown in Fig. 1 . 

In Fig. 1 , although the local memory for storing 
the program and the data used therein is similar to 

5 the local memory 6 shown in Fig. 1, there is no 
area to store the received data. As a result, no tag 
area for storing this purpose is employed, which is 
different from the function of the local memory 6 
shown in Fig. 1 . A detailed circuit of the processor 

w 3A is shown in Fig. 7. The function of the proces- 
sor 3A is different from that of the processor 3 
shown in Fig. 1 only in that read of the received 
data is carried out to the receive memory 8. 
In the circuit of Fig. 6, a receive memory 8 is 

is constructed of a plurality of memory areas for 
storing the data having one word length, to which 
an address is attached as one word unit. In the 
memory area, there are a data unit for storing the 
data, and in addition a tag unit for storing one bit 

20 tag. The receive memory 8 includes a first port for 
setting a line L3 to an address input line, a line L4 
to an input line of demand signals for the data and 
data write instruction, and a line L5 to an input line 
of demand signals for the tag and tag write instruc- 

25 tion; a second port for setting a line-.;.L6 to an 
address input line, and a line L7 to an input line of 
demand signals for the tag and tag write instruc- 
tion; and a third port for setting a line L8 to an 
input line of demand signals for the address, tag, 

30 and data read instruction, a line L9 to a tag output 
line, and a line L10 to an output line of the data. 
When more than two demands have arrived from 
the respective ports, the receive memory 8 prop- 
erly adjusts these demands and properly responds 

35 to them. The local memory 6A includes a first port 
for setting a line L15 to an input line of demand 
signals for the address and data read instruction, 
and a line L14 to an input line of a demand signal 
for the data write instruction and to an input line of 

40 the data; and a second port for setting a line L12 to 
an input line of demand signals for the address and 
data read instruction, and a line L13 to a data 
output line. When more than two inputs have ar- 
rived from each of these ports, the local memory 9 

45 properly adjusts these inputs and responds to them 
properly. 

The functions of this receive unit 4 are similar 
to those of the receive unit 4 shown in Fig. 1 
except that the received data is written not into the 
so local memory 6A, but to the receive memory 8. 

The operations of the respective processor ele- 
ments 1-1 to. 1-n are the same as those of the 
processor elements shown in Fig. 1 except the 
following operations, 
55 A SEND instruction employed in the parallel 

computer according to the second preferred em- 
bodiment has the same format (Fig. 4) as the 
SEND instruction according to the first preferred 



9 



17 



EP0 326 164 A2 



18 



embodiment. However, in the general-purpose reg- 
ister of the general-purpose register number R2, 
the address of the receive memory 8 of the pro- 
cessor element, not the locai memory address of 
the processor element at the data transmission 
side is previously stored. This address indicates an 
area into which the transmission data should be 
stored. 

A write controller 41 of the receive unit 4 
generates a tag having a value of 1 and a write 
demand, and both these tag and demand, and the 
address 40-1 within the receive register 40 are sent 
to the first port of the receive memory 8, at which 
position the tag and receive data are written. 

Although the RECEIVE instruction employed in 
the second preferred embodiment has the same 
format as that of the first preferred embodiment, 
there is such a difference that the address of the 
general-purpose register indicated by the register 
number R1 corresponds to the address at the 
memory position within the receive memory 8. 
When this instruction is performed, the operation of 
the parallel computer according to the second pre- 
ferred embodiment is different from that of Fig. 1 
such that the data readout operation with regard to 
the receive memory 8 is initialized. 

That is to say, the instruction decoding control- 
ler 37 of the processor 7 outputs the contents 
(register numbers) of the fields 36-2 and 36-3 of 
the instruction register 36 to the general-purpose 
register group 34 via the respective lines L305 and 
L306, and also outputs via the line L303 a signal 
impling that the RECEIVE instruction is executed 
so as to initialize the memory access circuit 38. As 
a result, the memory read demand signal gen- 
erated by the memory access circuit 38 via the line 
L301 together with the first operand as the memory 
access is output to the third port of the receive 
memory 8. Accordingly, the receive memory 8 
outputs the value of the tag to the line L9 and the 
data to the line L10. The data output to the line L10 
is set to the general-purpose register designated 
by the second operand. The value of the tag output 
to the line L9 is input into the memory access 
circuit 38. When the value of this tag is equal to 
"0", the memory access circuit 38 again generates 
the memory read demand signal to the line L301, 
whereby the above-described memory access op- 
eration is repeated. When the value of the tag 
output to the line L9 is equal to "1 w , the output of 
the AND circuit 300 in the invalidating circuit 33 
becomes "1". As a consequence, since the first 
operand output to the line L6 is used as the ad- 
dress, both the value "0" output to the line L7 and 
the output of the AND circuit 300 are output to the 
second port of the receive memory 8 as the tag 
write data and also tag write demand signal, the 
tag of the word designated by the first operand 



becomes "0" (the value represents that the data is 
invalid). When the tag on the line L9 is equal to 
"1", the memory access circuit 38 accordingly 
supplies a signal to the program counter 36 from 
5 the line L302, so as to update the content of the 
program counter 35. with the above-described se- 
ries of the operation, the RECEIVE instruction is 
completed. 

It should be noted that either the normal mem- 
70 ory read instruction, or memory write instruction is 
executed via the line LT4 or L15 with respect to the 
local memory 6A. 

The parallel computer according to the second 
preferred embodiment of the invention is operated 

rs similar to that of the first preferred embodiment 
except that the memory positions which are used 
for transferring the data between the processor 
elements are the receive memory 8. Therefore, the 
same advantages of the parallel computer accord- 

20 ing to the first preferred embodiment can be 
achieved in the second preferred embodiment. 

In addition to the above-described advantages, 
there is the following specific effect according to 
the second preferred embodiment. That is, since 

25 according to the second preferred embodiment, the 
memory (receive memory 8) used for receiving the 
data is divided into the memory (local memory 6A) 
for storing the program and data, it can be recog- 
nized that there exist two memory spaces such as 

30 the memory space of the receive memory 8 and 
the memory space of the local memory 6A. 

As a result, it is easily to produce the programs 
executed in each of the processor elements. That 
is to say, amounts of the data and program stored 

35 in the own local memory by the respective proces- 
sor elements are different from each other, de- 
pending upon the respective processor elements. 
As a result, as similar in the first preferred embodi- 
ment, when the data is directly written into the local 

40 memory of the processor eiement at the data trans- ' 
mission side, a determination is made which enpty 
area of the local memory in this processor element 
is utilized. Based upon this area determination, the 
program is produced. In case of that a large quan- 

45 tity of data are transferred between the processor 
elements, this determination becomes very com- 
plex. However, in case that the transmission data is 
written into the receive memory different from the 
local memory, a determination on the data written 

so into the position of the receive memory can be 
made irrelevant to the use condition of the local 
memory. As a consequence, the program can be 
easily produced. More specifically, when the pro- 
gram executed in this parallel computer is linkage- 

55 edited, and the unresolved external reference is 
solved, both 1), the solution for the unresolved 
external reference between the receive memories 
of the respective processor elements, and 2). the 
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solution t for the unresolved external reference with 
in the respective processor elements can be in- 
dependent executed. Then, if the program to pro- 
ceed with the calculation while the data is received 
between the processor elements, is once produced 
as the subroutine, and the unresolved external ref- 
erence of above 1) is solved, when this subroutine 
is utilized to be assembled into another program, it 
is possible to solve the unresolved external refer- 
ence of the above 2). In other words, the program 
can be readily again utilized. 



PARALLEL COMPUTER ACCORDING TO A 
THIRD PREFERRED EMBODIMENT 

Fig. 8 indicates a parallel computer according 
to a third preferred embodiment of the invention. 
The parallel computer according to the third pre- 
ferred embodiment corresponds to a modified par- 
allel computer of the first preferred embodiment. 
The same reference numerals employed in Fig. 8 
denote the same circuit elements shown in Fig. 1. 
A difference between the parallel computers shown 
in Figs. 1 and 8 is as follows. That is, in the receive 
unit 4A of the respective processor units, an ad- 
dress transform unit 100 is employed for transform- 
ing a virtual received memory address 40-1 as the 
data identifier into the address in the local memory 
6A, whereas another address transform unit 110 is 
employed for transforming a vertical receive mem- 
ory address as the data identifier into the address 
of the local memory. 

According to the third preferred embodiment, 
the advantage achieved in the second preferred 
embodiment can be realized by slightly modifying 
the parallel computer of the first preferred embodi- 
ment. The specific feature of the second preferred 
embodiment is caused by separating the memory 
space produced by the receive memory 8 (Fig. 5) 
from the memory space generated by the local 
memory 6A. To divide the space, the memory itself 
may be divided as described in the second pre- 
ferred embodiment. However, the space division 
can be achieved according to another different 
method of the third preferred embodiment. That is 
to say, as illustrated in Fig. 12, a memory space 81 
of a vertual receive memory is mapped into the 
receive memory area 92 of the memory space 91 
of the local memory 6A. When the receive memory 
area 92 is located within a range from the address 
x to (x + n-1), the received address is transformed 
into the address larger than it by "x". With regard 
to the computer architecture, the separate spaces 
are realized on the same memory of the hardware. 

The address transform unit 100 is arranged by, 
as shown in Fig. 10, an adder 102 for adding the 
received address 40-1 to the above-described con- 



stant value. Setting the value of x to the register 
101 may be performed by, for instance, the pro- 
cessor 3B shown in Fig. 8 (a line required for this 
purpose is omitted in the drawing). As illustrated in 

5 Fig. 11, the address transform unit 110 is con- 
structed by an adder 114 for adding the above- 
described constant value "x" to the address on the 
line L313 read from the general-purpose register 
134 (will be discussed later). Setting the value to 

10 the register 113 may be performed similar to that 
of the register 101. 

Fig. 9 illustrates a circuit arrangement of a 
processor 3B. In Fig. 9, the same reference nu- 
merals shown in this figure indicate those of Fig. 2. 

75 A different point of the processor shown in Fig. 9 
from that of Fig. 2, is such that in addition to the 
unit 110, a selector 112 is employed so as to 
select the output L313 of the general-purpose reg- 
ister and the output of the address transform unit 

20 110, and the instruction decoding controller 37A 
outputs a signal L312 for selecting the selector 1 12 
during the RECEIVE instruction decoding opera- 
tion, in addition to the function of the instruction 
decoding controller 37 shown in Fig. 2. 

25 When this signal L312 is not output, .this selec- 

tor 112 continuously transfers the line L313 to the 
line 314. 

An operation of the processor shown m Fig. 8 
has the following different point as compared with 

30 that shown in Fig. 1. That is, when the SEND 
instruction (Fig. 4) is executed, the virtual receive 
memory address functioning as the data identifier, 
which has been determined with respoct to the 
data to be sent, is previously stored mto the 

35 general-purpose register, having the register num- 
ber R2. The same data transmission operation after 
the SEND instruction is accomplished »s performed 
as that of Fig. 1. 

In the transmitted data receive unit 4A. the 

40 received virtual receive memory address 40-1 is 
transformed into the local memory address by the 
address transform unit 100. The received data to- 
gether with the tag having a value of i is written 
into the local memory 6 by utilizing this trans- 

45 formed address. After the SEND instruction is com- 
pleted, in the register number R1 designated by 
this instruction (Fig. 5), the virtual receive memory 
this instruciton is performed, as explained in Fig. 1, 
this address is output on the line L313. In the third 

50 preferred embodiment, this address is transformed 
into the corresponding local memory address by 
the address transform unit 110. The selector will 
4 select the transformed address in response to the 
signal on the line L312, and then will send it to the 

55 local memory 6. The subsequent operation is the 
same as that shown in Fig. 1. 

Then, a description will now be made to the 
newly introduced RECEIVE instruction. This RE- 
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CE1VE instruction has the same format as that of 
the first preferred embodiment, and similarly has 
the same operand. When this instruction is ex- 
ecuted, the parallel computer according to the third 
preferred embodiment will be operated as follows. 

That is to say, the instruction decoding control- 
ler 37 of the processor 11 firstly outputs the con- 
tents of the fields 36-2 and 36-3 of the instruction 
register 36 to the general-purpose register group 
34 via the respective lines L305 and L306, and also 
outputs via the line L303 a signal impling that the 
RECEIVE instruction is executed so as to initialize 
the memory access circuit 38. As a result, the first 
operand as the memory address is output via the 
line operand as the memory address is output via 
the line L313 address transform 110 and selector 
112 to the line L314. A memory read demand 
signal generated by the memory access circuit 38 
together with the memory read demand signal sent 
via the line L301 and OR circuit 31 is output to the 
third port of the local memory 6. Accordingly, the 
local memory 6 outputs the value of the tag to the 
line L9 and also the data to the line L10. The data 
output to the line L10 is set to the general-purpose 
register designated by the second operand. The 
value of the tag output to the line L9 is input into 
the memory access circuit 38. When the value of 
this tag is equal to "0", the memory access circuit 
38 again generates the memory read demand sig- 
nal to the line L301. whereby the above-described 
memory access operation is repeated. When the 
value of the tag output to the line L9 is equal to 
"1", the output of the AND circuit 300 in the 
invalidating circuit 33 becomes "1". As a con- 
sequence, since the first operand output via the 
line L313, and address transform unit 110 to the 
line L6 is used as the address for the identifier, 
both the value "0" output to the line L7 and the 
output of the AND circuit 300 are output to the 
second port of the local memory 6 as the tag write 
data and also tag write demand signal, the tag of 
the word designated by the first operand becomes 
"0". When the tag on the line L9 is equal to "1", 
the memory access circuit 38 accordingly supplies 
a signal to the program counter 35 from the line 
L302, so as to update the content of the program 
counter 35. With the above-described series of the 
operation, the RECEIVE instruction is completed. 

In the above-described preferred embodiment, 
the address transformation executed in conjunction 
with the SEND instruction execution was performed 
in the address transform unit 100 within the receive 
unit 10 shown in Fig. 8. However, this address 
transformation may be performed until the value is 
actually written into the local memory after the 
SEND instruction has been accomplished. As a 
consequence, as shown in Fig. 13, an address 
transform unit 120 may be provided within the send 



unit 12, where the same circuit arrangement of the 
address transform unit 110 shown in Fig. 8 is 
employed. 

In the above-described embodiment, the meth- 

s od for adding a constant value of "x" has been 
described as the mapping method. The present 
invention is not limited to this mapping method. For 
instance, such a mapping method may be em- 
ployed such that the address of the local memory 

10 can be determined only from the address of the 
receive memory space, and also there is no double 
positionings. 

Also when the data is written into the local 
memory of the other processor element under the 

15 SEND instruction, the local data is not destructed 
by the other processor element due to an error in 
the program, if the data write operation to the area 
other than the receive memory area 92 is not 
performed. For instance, when the memory capac- 

20 ity of the receive memory area 92 is selected to be 
"n" words, the complier may confirm that the ad- 
dress of the virtual receive memory as the instruc- 
tion of the second operand is less than the "n" 
address when the SEND instruction is performed. If 

25 this address is large than the "n" address, the data 
write operation is suppressed. Otherwise, it may be 
confirmed that after either the address transform 
unit 100 (Fig. 8) or the address transform 20 (Fig. 
13) is executed, the transformed address is larger 

30 than the "X w address and less than the (x + n) 
address. 

According to the parallel computer of the third 
preferred embodiment, there are the same advan- 
tages as in the second preferred embodiment in 

35 addition to the following merit. That is to say, since 
in memory 8 and local memory 6A are separated 
from each other in the hard ware, when the large 
scale calculation is performed and the memory 
capacity of the local memory 6A becomes short- 

40 age, this calculation cannot be performed even if 
there is sufficient memory capacity in the receive 
memory 8. In the converse memory capacity case, 
such a large scale calculation cannot be performed. 
However, according to the third preferred embodi- 

45 ment, if there is a sufficient memory space in the 
receive memory, this memory space may be uti- 
lized as that of the local memory. In other words, if 
a summation between the memory capacity of the 
receive memory 8 and that of the local memory 

so according to the second preferred embodiment, is 
equal to that of the local memory 6A according to 
the third preferred embodiment, the parallel com- 
puter according to the third preferred embodiment 
can perform the larg-scale calculation as compared 

55 with in the second preferred embodiment. 
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FOURTH PREFERRED EMBODIMENT 

It should be noted that a parallel computer 
according to a fourth preferred embodiment is con- 
structed by modifying that of the third preferred 
embodiment. In the parallel computer according to 
the fourth preferred embodiment, the local memory 
6 is addressed by a byte unit. If either the data 
read/written for the local memory 6, or the data 
transferred between the processor elements is larg- 
er than the address unit of the local memory 6, 
such as 8 bytes, the 1-bit tag is stored in the 
receive data storage of the local memory 6. As- 
suming that "n" pieces of the receive data are 
stored in the virtual receive memory area 92 (see 
Fig. 16) of the local memory 6, this area 92 is 
divided into "n" pieces of 8-byte areas, and then 
one piece of the received data is stored into the 
respective 8-byte areas. As a result, if the idea 
introduced in the parallel processor according to 
the third preferred embodiment is utilized, the fol- 
lowing operations are conceived. That is to say, 
when the data is transferred between the processor 
elements, as the address attached to this data, an 
address difference "d" (relative address) calculated 
from a head address n a" of one of 8-byte areas 
within the virtual receive memory area 92, which 
has been previously determined with respect to 
this data, and also another head address "x" of the 
virtual receive memory area 92 (Fig. 16), is trans- 
ferred, and then in the processor element at the 
data reception side, the received address differ- 
ence "d" is added to the head address "x" so as 
to obtain the address "a" used for storing the data. 
However, if the data length is equal to 8 bytes, the 
lower 3 bits of the above-described address dif- 
ference "d" are continuously "0" with respect to 
any data. 

In the fourth preferred embodiment the local 
memory of the processor element corresponds to a 
memory to which an address is attached as a word 
unit. Since the data having 1 byte unit such as 
character data must be processed in the practical 
parallel computer, there are many cases that the 
address is attached to the data as a byte (8 bits) 
unit. On the other hand, according to the normal 
numeral value data process, there are popular 
cases that the data calculation is carried out in 
either the unit of 4 bytes or 8 bytes. As a con- 
sequence, if the word is equal to 1 byte, and the 
tag is attached to every 1 byte, both the SEND 
instruction and RECEIVE instruction must be ex- 
ecuted 8 times for each instruction in order to 
assure the reference order of, for instance, 8-byte 
data. This causes that not only a lengthy instruction 
execution is required, but also a large quantity of 
hardware is needed to prepare for attaching the tag 
to the data. To this end, in the parallel computer 



according to the fourth preferred embodiment, 
while the data within the processor element can be 
processed in 1 byte unit, the lower 3 bits of this 
data are not transferred, but these lower 3-bit data 

s are regenerated in the processor element at the 
data reception side. As a consequence, a length of 
the packet to be communicated is shortened, and 
thus the amount of the data transmission is re- 
duced, which eventially improves the data commu- 

w nication speed. Considering the virtual receive 
memory defined by the difference address (relative 
address) from which the lower 3-bit data have been 
deleted, it can be understood that to a memory 
space 82 of the virtual receive memory, an address 

rs is attached every 8 bytes data, and the address of 
this space is allocated to every 1 byte, which 
corresponds to the mapping of the local memory 
space 91. If the address of the virtual receive 
memory's space 82 is equal to "a", and the cor- 

20 responding address of the local memory's space 
91 is equal to "b", the mapping is carried out as 
follows: b = a x 8 + x. Where the value "x" is the 
same as the constant value "x" described in the 
third preferred embodiment, and therefore corre- 

25 spends to the least significant address of the virtual 
receive memory area. In other words, the address 
except for the lower 3 bits corresponds to the data 
identifier representative of the order of the receive 
data storage area according to the fourth preferred 

30 embodiment That is to say, when the SEND irv- 
struction is executed, the processor element at the 
data send side reads from the general-purpose 
register group 34, the address which is produced 
by deleting the lower 3 bits of the relative address 

35 determined by the data to be sent, and transfers 
this address as the packet together with the data. 
In the address transform unit 104 at the data recep- 
tion side of the processor element, the received 
address is shifted by 3 bits in the left direction by 

40 means of a 3-bit left shift circuit 103. 3-bit '0" is 
added to the lower bit side, and the resultant data 
is added to a boundary address "x" in the register 
101 by the adder 102 so as to produce the local 
memory address. In the processor element at the 

45 data reception side, when the RECEIVE instruction 
is executed, the address generated by deleting the 
lower 3 bits from the relative address is read from 
the general-purpose register group 34, and the 
local memory address is generated in the address 

so transform unit 1 14 (Fig. 1 5). 

PARALLEL COMPUTER ACCORDING A FIFTH 
PREFERRED EMBODIMENT 

55 

A parallel computer according to a fifth pre- 
ferred embodiment is shown in Fig. 17. The parallel 
computer of the fifth preferred embodiment is con- 
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structed by modifying that of the first preferred 
embodiment It should be noted that same refer- 
ence numerals employed In the parallel computer 
of the first preferred embodiment denote the same 
or similar circuit elements employed in Fig. 17. 

A detailed circuit of a processor 3d is shown in 
Fig. 18. In the processor shown In Fig. 18, refer- 
ence numeral 1307 indicates a send control circuit 
which is used for performing a VSEND instruction 
(will be discussed later). Reference numeral 1302 
denotes a memory access circuit which is used to 
execute a VRECEIVE instruction (will be discussed 
later). Reference numeral 1308 is a selector. Refer- 
ence numerals 1309 and 1317 represent selectors, 
respectively. Reference numeral 34 denotes a 
general-purpose register group. Reference numeral 
1303 indicates a vector register group. Reference 
numerals 1304 and 1305 are pipeline calculators. 

The vector register group 1303 is constructed 
of a plurality of vector registers, and each of these 
vector registers is successively accessed in the 
order of the first element during the data read/write 
operations. Reference numeral 1306 is an address 
generating circuit, an internal circuit of which is 
shown in Fig. 19. in Fig. 19, reference numeral 
1311 is an increment circuit, reference numeral 

1313 is a decrement circuit, reference numeral 

1314 denotes a zero detector circuit, and reference 
numerals 1315 and 1316 are selectors. 

Although this processor 13 corresponds to a 
so-called "vector computer", it can execute a new- 
ly introduced instruction in addition to the instruc- 
tion set of the normal vector computer (memory 
read for scalar data, memory write, memory read 
for calculation instruction and vector data, memory 
write, calculation instruction and so on). This newly 
introduced instruction will be described later. 

Referring now to Figs. 17 to 19, an operation of 
the processor element will be described. 

An instruction decoding controller 1301 de- 
codes a value of an instruction code stored in the 
field 36-1 among the instructions which have been 
set in the instruction register 36, distributes a sig- 
nal for realizing an operation designated by this 
instruction into an internal circuit of the processor 
3, and then operates the calculator 1304, calculator 
1305, general-purpose register group 34 and vector 
register group 1303 and the like. When the opera- 
tion designated by the instruction is accomplished, 
the instruction decode controller 1301 updates the 
value of the program counter 35 by the line L1308, 
and repeates the above-described series of the 
operation. 

Subsequently, the newly introduced instruction 
will now be described. 

First of all, the VSEND instruction is described. 
This VSEND instruction is an instruction to write 
vector data of one processor element which has 



executed this instruction, into a local memory lo- 
cated within the other processor element. Fig. 20 
illustrates a format of this VSEND instruction. The 
VSEND instruction has the following three 
s operands. 

1 . destination 

2. base address 

3. vector data. 

Each of these operands has been stored into a 

w general-purpose register designated by the instruc- 
tion of R1, into a general-purpose register des- 
ignated by the instruction format of R2, and into a 
vector register designated by a VR3 field. This 
instruction implies that the vector data designated 

rs by the third operand is written into a continuous 
region starting from the address designated by the 
second operand, of the local memory of the pro- 
cessor element designated by the first operand. It 
should be noted that the number of the elements of 

20 the vector data has previously been stored into the 
specific general-purpose register group 34 in re- 
sponse to the other different instruction. 

When executing the VSEND instruction, the 
parallel computer according to the fifth preferred 

25 embodiment is operated as follows. 

First of all. the instruction decoding controller 
1301 of the processor 13 outputs the values 
(register number) of the fields 36-2 and 36-3 of the 
Instruction register 36 to the general-purpose regis- 

30 ter group 34 via the respective lines L1302 and 
L1303, and also outputs to the vector register 
group 1303 the value (vector register number) 
stored into the field 36-4 via the line LI 304. Also 
this controller transfers a signal impling that the 

35 VSEND instruction is executed to the send control 
circuit 1307 via the line L1310 so as to initialize this 
send control circuit 1307. As a result, the write 
demand signal for the register generated by the 
send control circuit 1307 via the line L1312 and 

40 selector 1308 together with the first operand as the 
line L20 is output and set to the field 50- 1 of the 
register 50 of the send unit 5. The second operand 
is stored via the line L1301 and selector 1315 to 
the register 1310. and the element number of the 

45 vector data is stored into the register 1312 via the 
line L1301 and selector 1316. 

Subsequently, the following operations will be 
repeated. 

First, the content of the register. 1310 in the 
50 address generating circuit 1306 is output via the 
line L1315 and selector 1309 to the line L1 320, and 
also the write demand signal generated by the 
send control circuit 1307 is output to the line 
L1319, and thus this is represented and transferred 
55 to the send unit 5 as the line L21 and then set in 
the field 50-2. in conjunction with this operation, the 
element of the vector register designated by the 
third operand is read on the line L1311 so that it is 
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transferred as the line L20 to the send unit 5 
together with the write signal which is generated by 
the send control circuit 1307 and output via the line 
L1312 to the selector 1308. Thereafter it is set in 
the field 50-3. Subsequently, the send control cir- 
cuit 1307 sends a signal via the lines L1313 and 
L1306 to the increment circuit 1311 and decrement 
circuit 1313 within the address generating circuit 
1306, whereby the contents of these registers 1310 
and 1312 are increased by 1, and decreased by 1, 
respectively. At this time, the zero detecting circuit 
1314 sends a signal to the send control circuit 
1307 via the line L1307 if the resultant value cal- 
culated by decreasing the content of the register 
1312 by 1 becomes "0". 

In the send unit 5, every time the value is set 
into the register 50, the content of the register 50 is 
sent to the processor element designated by the 
content of the field 50-1 via the network. The 
operations after this value transmission are per- 
formed in the same manner as in the first preferred 
embodiment. 

Since the element number of the vector num- 
ber has been set in the register 1312 at the begin- 
ning of the instruction execution stage, the above- 
described operations are repeated at the same 
times as the element number. Thereafter, the sig- 
nal is sent via the line L1307 to the send control 
circuit 1307. Upon receipt of this signal, the send 
content of the program counter on the line L1308. 
As previously described, the operations of the 
VSEND instruction have been completed. 

Then, the VRECEIVE instruction will be de- 
scribed. 

This VRECEIVE instruction is an instruction to 
read out valid data from the local memory of the 
processor element where this instruction is per- 
formed, and to store the read data into the vector 
register. Fig. 21 shows a format of the VRECEIVE 
instruction. The VRECEIVE instruction has the fol- 
lowing two operands. 

1 . base address 

2. vector register number. 

Each of these operands has been stored into a 
general-purpose register designated by the instruc- 
tion format of R1 , and into a VR2 field. This instruc- 
tion implies that the valid data written into a con- 
tinuous region starting from the address designated 
by the first operand is sequentially read, and then 
stored in the vector register designated by the 
second operand, it should be. noted that the num- 
ber of the elements of the vector data has pre- 
viously been stored into the specific general-pur- 
pose register group 34 in response to the other 
different instruction. 

When executing the VRECEIVE instruction, the 
parallel computer according to the fifth preferred 
embodiment is operated as follows. 



First of all, the instruction decoding controller 
1301 of the processor 13 outputs the values 
(register numbers) which have been stored into the 
field 36-2 of the instruction register group 34 via 

5 the line, and also outputs the values (vector regis- 
ter numbers) which have been stored via the line 
L1303 to the vector register group 1303. Also it 
sends a signal impling that the VRECEIVE instruc- 
tion is executed to the memory access circuit 1302 

10 via the line L1309 so as to initialize this memory 
access circuit 1302. On the other hand, the first 
operand is sent via the line L1301 and selector 
1315 and stored into the register 1310. The ele- 
ment number of the vector data is sent via the line 

75 L1301 and selector 1316 to the register 1312. 

Subsequently, the following operations will be 
repeated. 

First, the content of the register 1310 in the 
address generating circuit 1306 is output via the 
so line L1315 from selector 1309, and also the read 
demand signal generated by the memory access 
circuit 1302 is output to the line L1305 from the OR 
circuit 31 and this output is transferred to the third 
port of the local memory 6, as the line L8. As a 
25 consequence, the local memory 6 outputs the val- 
ue of the tag on the line L9, and the data on the 
line L10. The tag output on the line L9 is input into 
the memory access circuit 1302. When this value 
is equal to "0", in the memory access circuit 1302. 

30 the flip-flop 1399 (Fig. 20) is not reset, and a newly 
generated read demand by the circuit 1401 is 
again output via an AND gate 1400 to the line 
L1305 so as to read the data stored in the local 
memory 6. When the value of the tag is equal to 

35 "1". this value is directly output on the lines LI 321 
and L1306, and thus, the vector register group 
writes the data into the vector register designated 
by the second operand. Upon receipt of the tag 
from the line L1306, in the address generating 

40 circuit 1306, the contents of the internal registers 
1310 and 1312 are increased by the increment 
circuit 1311 by 1, and decremented by the de- 
crement circuit 1313 by 1, respectively. When the 
content of the register 1312 becomes "0" as a 

45 result of the decrement by 1, the zero detecting 
circuit 1314 announces to the memory access cir- 
cuit 1302 via the line L1307. Since the value output 
to the line L9 becomes 1, and therefore the output 
of the AND circuit 300 within the invalidating circuit 

so 33 becomes 1. both the value of "0° on the line L7 
and the output from the AND circuit 300 are trans- 
ferred as the tag write data and write demand 
signal to the second port of the local memory 6, as 
the address for the value outpt via the line L1315 

55 and selector 1317 on the line L6, in conjunction 
with the operation of the memory access circuit 
1302 shown therein. As a result, the tag of the 
word for addressing the value output on the line 6 
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becomes "0". It should be noted that this operation 
is so controlled to be effected prior to the incre- 
ment of the content of the above-described register 
1312. 

since the element number of the vector data 
has been set in the register 1312 at the beginning 
of the instruction execution stage, the above-de- 
scribed operations are repeated at the same times 
as the element number, and thereafter the signal is 
sent via the line L1307 to the memory access 
circuit 1302. In this memory access circuit 1302, 
the flip-flop 1399 is reset by the tag on the line 
L1 307, the memory read demand is not longer sent 
to the line L1305, and thus a series of the memory 
access operation is completed. Tha tag on the line 
L1307 is transferred via the line L1308 to the 
program counter 35 so as to update the contents of 
the program counter 35. The above-described op- 
erations are the operation of the RECEIVE instruc- 
tion. 

Subsequently, a description is made to the 
operation of the parallel computer according to the 
fifth preferred embodiment. Each of the processor 
elements constituting the parallel computer of the 
fifth preferred embodiment is operated similar to 
the normal vector calculator, under the condition 
that the data is transferred between this processor 
element and the other processor element. When 
the data transmission between one processor ele- 
ment and the other processor element is needed, 
with respect to a region used for receiving the data 
(this region is previously determined by either a 
programmer or compiler) in the local memory of 
the processor element at the data reception side, 
the processor element at the data transmision side 
writes the data by the above-described VSEND 
instruction. This write operation can be realized in 
designing the program in such a manner that the 
processor element at the data reception side reads 
the data in response to the VRECEIVE instruction. 
With such an arrangement, the same advantage as 
that of the first preferred embodiment can be 
achieved and furthermore a high-speed data com- 
munication can be established as compared with 
the first preferred embodiment. 

In the above-described parallel computer, the 
region of the local memory 6 accessed by the 
VSEND instruction and VRECEIVE instruction was 
the continuous region. If the program can be so 
designed the value to be increased by the incre- 
ment circuit within the address generating circuit 
1306 is programmed, it is possible to handle the 
data group equidistantly arranged on the local 
memory. 

in order to furthermore introduce a freedom, 
the following VSENDL instruction and VRECEIVEL 
instruction may be employed. 



Fig. 23A illustrates a format of this VSENDL 
instruction. The VSENDL instruction has the follow- 
ing three operands. 
1. destination 
5 2. address vector 

3. vector data 

Each of these operands has been stored in the 
respective vector registers designated by the in- 
10 struction formats R1 . VR2 and VR3. This instruction 
implies that a j-th element of the vector of the third 
operand is written into an address designated by 
the j-th element 0 = 1, 2, — . element number of 
vector data) of the vector of the second operand in 
75 the local memory of the processor element des- 
ignated by the first operand. It should be noted that 
the element number of the vector data has been 
previously stored within the specific register among 
the general-purpose register group in response to 
20 another instruction. 

This instruction can be executed in the sub- 
stantially same manner as the above-described 
VSEND instruction. There are the following different 
points. That Is, according to the previous VSEND 
25 instruction, as the value set into the field 50-2 of 
the register 50 in the send unit 5. a value gen- 
erated from the address generating circuit 1306 is 
employed, and is set into the field 50-2 via the line 
L1315. selector 1309, line L1320 and line L21. To 
30 the contrary, according to this VSENDL instruction, 
the vector register within the vector register group 
1303 previously designated by the contents of the 
field 36-3 in the instruction register 36 is selected 
via . the line L1303. and then the content of this 
35 vector register is set into the field 50-2 via the line 
L1314. selector 1309, line L1320 and L21. 

Subsequently, a format of the VRECEIVE in- 
struction is illustrated in Fig. 23B, This VRECEIVE 
instruction has the following two operands: 
^0 1. address spectrum 

2. vector register number. 
Each of the operands is present in the vector 
register and VR field designated by the VR1 field 
of the instruction format. This instruction implies 
45 that the valid data is read from the address des- 
ignated by a j-th element (j = 1. 2. — . element 
number of vector data) of the vector of the first 
operand, and written into a j-th element of the 
vector register of the second operand. It should be 
so noted that the element number of the vector data 
has previously been stored in the specific register 
within the general-purpose group in response to 
another instruction. 

Although this instruction can be executed in the 
55 substantially same manner to that of the 
VRECEIVE instruction, there are different points as 
follows. In accordance with the VRECEIVE instruc- 
tion, a value generated by the address generating 
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circuit 1306 is employed as the read address for 
the local memory 6, and is transferred to the local 
memory 6 via the line L1315, selector 1309 and 
line L8. in contrast thereto, according to the 
VRECEIVE instruction, the vector register within the 
vector register group 1303 previously designated 
by the content of the field 36-2 in the instruction 
register 36 is selected, and the content of this 
vector register is sent to the local memory 6 via 
either the line L1315, selector 1317 and line L6. or 
line L1315, selector 1317 and line L6. 

According to the parallel computer of the fifth 
preferred embodiment, a column of the address of 
the local memory 6 is previously set into the vector 
register within the vector register group, this is 
designated to the second operand of the VSENDL 
instruction, or the first operand of the VRECEIVEL 
instruction, so that an arbitrary data group arranged 
on the local memory 6 can be sent/received be- 
tween the relevant processor elements with keep- 
ing the higher efficiency and assuring the reference 
order. 



PARALLEL COMPUTER ACCORDING TO A SIXTH 
PREFERRED EMBODIMENT 

Referring now to figures, a parallel computer 
according to a sixth preferred embodiment of the 
invention will be described. Fig. 24 is a schematic 
block diagram of an entire circuit arrangement of a 
parallel processor according to the sixth preferred 
embodiment. 

In Fig. 24, reference numeral 2001 denotes a 
data transfer network between processor elements. 
Referrence numerals 2002-1 to 2002-3 indicate 
processor element. An internal circuit arrangement 
of the respective processor elements is the same 
as each other. Reference numeral 2003 is a local 
memory in the processor element. Reference nu- 
meral 2004 indicates a receive unit, reference nu- 
meral 2005 denotes a send buffer, reference nu- 
meral 2006 indicates an instruction process unit, 
reference numeral 2007 denotes a memory control- 
ler, reference numeral 2013 indicates an instruction 
controller, reference numeral 2014 is a receive 
controller, reference numeral 2015 is a general- 
purpose register, reference numeral 2016 indicates 
a scalar calculator, reference numeral 2017 repre- 
sents a vector processing unit, reference numeral 
2030 represents an instruction register and refer- 
ence numeral 2013 denotes an instruction register. 
Further, reference numeral 2003 is a program 
counter PC. 

The function of the local memory 2003 is to 
stored either a program, or data therein, which 
includes a tag unit for storing 1-bit tag every word 
unit (1 word corresponds to 4 bytes in the pre- 



ferred embodiment). This tag unit is newly intro- 
duced so as to perform the parallel computer ac- 
cording to the present invention. 

An instruction fetch 2031 sequentially reads 
5 instruction addresses denoted by the program 
counter 2032 of the local memory 2003 to the 
instruction register 2030, and the read instruction is 
decoded in the instruction decoding unit 2013. In 
case that the read instruction designates any of the 
10 general-purpose register group 2015. the desig- 
nated register number is supplied thereto, other- 
wise the calculating units 2016 and 2017 are con- 
trolled in such a manner that the calculation is 
carried out, which is designated by the read in- 
75 struction. As illustrated in detail in Fig. 27, the 
vector process unit 2017 is constructed of a vector 
calculator 2017 and a vector register group 2070. 

In Fig. 24, only three pieces of the processor 
elements are illustrated. It is apparent that any 
20 number of the processor elements may be em- 
ployed. The function of the data transfer network 
between the processor elements is to transfer a 
message to the processor element having the PE 
number to which this message is sent. As this data 
25 transfer network 2001 between the processor ele- 
ments, a cross bar switch, multi-stage switch net- 
work or bus may be utilized. 

First, data transfer - process will now be de- 
scribed. An instruction for demanding the message 
30 transfer is so-called as "a send instruction". The 
formats of this instruction are as follows, i.e., SEND 
GR1, GR2, GR3 and GR4. SEND implies an ope- 
code, GR1 to GR4 denote numbers of the general- 
purpose registers for storing the data to be sent, 
35 the main identifier MK and sub-identifier SK for the 
data to be sent, and the destination processor 
element number. Moreover, in the general-purpose 
register named by "(GR2) + 1" number, a length 
of the main identifier MK has been previously held. 
40 When this send instruction is read by the in- 

struciton fetch circuit 2031 and set in the instruc- 
tion register 2030, the instruction decoding unit 
2013 decodes the instruction, and transfers the 
destination processor element number and transfer 
45 data corresponding to the contents of the general- 
purpose register via the line I 2020 to the send 
buffer 2005. In addition, the main identifier MK, 
sub-identifier SK and a length "L" of the main 
identifier corresponding to the contents of the 
so general-purpose register are sent to the address 
generating unit 2018. This address generating unit 
2018 generates the address of the local memory in 
the processor element at the data transmission 
side based upon three pieces of the input informa- 
55 tion, and transfers this address via the line I 2021. 

This address generating unit 2018 according to 
a preferred embodiment is illustrated in Fig. 25. 
The input main identifier's length "L" becomes via 
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a subtracting circuit 2040 shift count information of 
a left shifter 2041 , so as to shift the main identifier 
MK by (32-L) bits to the left direction (in this 
embodiment it is assumed that the address space 
of the local memory is 32 bits). Furthermore, this 
resultant value is OR-gated in an OR gate 2043 
with another resultant value obtained by shifting the 
sub-identifier SK by 2 bits in the shifter 2042 in the 
left direction, and the OR-gated value is then set as 
the address of the local memory of the processor 
element at the data reception side, via the line L 
2021 , into the send buffer 2005. 

Referring back to Fig. 24, the message gen- 
erated by the send buffer 2005 is transferred to the 
data transfer network 2001, and then sent to the 
processor element having the destination PE 
(processor element) number within the message. 

The feature of the fifth preferred embodiment 
is as follows. That is, the identifier for the data to 
be sent is arranged by the main identifier MK 
representative of the data group belonging to this 
data, and the sub-identifier SK for discriminating 
this data from other data in the data channel, and 
based upon this main identifier MK and sub-iden- 
tifier SK, the address of the destinated local mem- 
ory within the processor element at the data trans- 
mission side is generated. For instance, when the 
maximum value is found out from a large quantity 
of the data group, the above-described data group 
is subdivided into the respective processors, each 
of these processors detects the maximum value 
from the data group processed by the respective 
processors, and the maximum value data within the 
processor, which has been obtained by the respec- 
tive processors, is transferred to one processor for 
retrieving the maximum value from all of the data 
groups. 

At this time, the main identifier MK together 
with the data to be sent indicate that the trans- 
ferred data are those for retrieving the maximum 
value, whereas the sub-identifier SK indicates the 
number of the data among the above data groups. 

As another example, in case that the data to be 
sent is equal to one element in certain vector data, 
the main identifier MK represents the vector data 
belonging to the data to be sent, and the sub- 
identifier SK indicates the element number thereof 
within the vector data. 

The format of the send instruction at this time 
is as follows: SEND, VR1, GR2, QR3 and GR4. 

It should be noted that VR1 Is a vector register 
number for holding the vector register number for 
holding the vector data to be transferred, GR2 
indicates an identifier MK for the data. GR3 is an 
element number to be transferred within the vector 
data, and GR4 corresponds to a PE number of the 
destinated processor. Furthermore, a length of the 
identifier MK is previously held in the general- 



purpose register having the number of (GR2 + 1 ). 

The data together with VR1 and GR3 are read 
from the vector register 2070 (Fig. 27) of the vector 
process unit 2017, and send via a line I2022 to the 

s send buffer 2005. The transfer information other 
than this data is also sent from the general-purpose 
to a transfer buffer 2005. All of the vector data in 
the vector register can be transferred by repeatedly 
updating the content of the general-purpose regis- 

w ter designated by GR3, i.e., the element number. 

In case that the destinated PE number within 
the message in the data transfer network 2001 
corresponds to PE2002-1, both the address and 
data constructed from the main identifier MK and 

is sub-identifier SK within this message are held in 
the receive buffer 2004-1 and receive buffer 2004- 
2, respectively, and moreover initializes the write 
controller 2034. By this write controller 2007, the 
content of the register 2004 is sent to the memory 

zo controller 2007, the data of the register 2004-2 is 
stored into the address on the local memory repre- 
sentative of the register 2004-1. Furthermore, the 
write controller 2034 initializes a 1 generating cir- 
cuit 2008, and sets the tag corresponding to the 

25 address in the storage into 1 . 

At this time, the memory controller 2007 
exclusive-controls both the write controller 2031 for 
the local memory 2003, and the access demand by 
the memory access controller 2033. 

30 The function of this memory controller 2007 is 

similar to that of the normal computer system. Also 
in the normal computer system, the exclusive-con- 
trols both the read/write access from CPU, and the 
read/write access from I/O. There are two different 

35 points: 

(1) . Only write demand is given from the 
write controller 2034. 

(2) . The tag is set/reset because of the tag 
having the word unit. As a consequence, one mes- 

40 sage has been transferred from one processor ele- 
ment to the other processor element. 

The instruction for demanding the readout of 
this received data is so-called as "a received in- 

45 struction". Also, the data read during the execution 
of this instruction is so-called as "received data" in 
the following description. In the fifth preferred em- 
bodiment, there are some received instructions. A 
format of one instruciton is as follows: RECEIVE, 

so GR1, GR2 and GR3. 

It should be noted that RECEIVE is an ope-code of 
this instruction, GR1, GR2, GR3 indicates a number 
of- a general-purpose register for storing the re- 
ceived data, a number of a general-purpose regis- 

55 ter for holding a main identifier used for retrival 
operations, and a number of a general-purpose 
register for holding a length M L M of this main iden- 
tifier, respectively. The sub-identifier SK which has 
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been attached to the receive data is stored into the 
general-purpose register having a number of 
(GR1 + 1). 

When this received instruction is read from the 
instruciton fetch circuit 2031 , set into the instruciton 
register 2030, and furthermore decoded in the in- 
struction decode 2013, both the main identifier MK 
and the length "L" of the main identifier MK of two 
general-purpose registers GR2 and GR3 designat- 
ed by the receive instruction are sent via the lines I 
2026 and I 2027 to the receive controller 2014 so 
as to initialize the receive controller 2014. Further- 
more, the read demand for reading the received 
data from the local memory 2003 is generated from 
the memory access controller 2033. 

The receive controller 2014 successively pro- 
ducing the address within the region at which the 
received data group arrives until the received data 
is generated based upon the receive two informa- 
tion, and retrieves whether or not the tag cor- 
responding to the address on the local memory is 
valid. 

In case that the received data has arrived (i.e., 
the corresponding tag becomes valid), the sub- 
identifier corresponding to the data is generated, 
and is stored in the general-purpose register des- 
ignated by the receive instruction, via the line l 
2029 together with the data sent via the line I 2028. 
Thereafter, the tag for the retrieved data of the 
local memory is subjected to be invalidated. Fur- 
thermore, a signal of T representative of the 
desired data is set in the condition code register 
2019 in the scalar calculation 2016. 

When no received data is present in the region 
at which the received data on the local memory 
arrives (i.e., all of the tags corresponding to the 
region where the received data arrive), another 
signal of "0" representing that the desired data 
could not be found out is set in the condition code 
register 2019. 

Fig. 26 shows a circuit diagram of the receive 
controller 2014 according to a preferred embodi- 
ment. The main identifier length "L" sent via the 
line I 2027 becomes a shift number of the shifter 
2051 via the subtractor 2050. The main identifier 
MK sent from the line I 2026 is shifted by (32-L) 
bits in the left direction at the shifter 2052. The 
content of this register 2052 is sent to the local 
memory via the line 2066 as the head address of 
the region where the received data group on the 
local memory is stored. Furthermore, this content 
is used to initialize the memory access controller 
2033 (Fig. 24) via the line I 2063. This memory 
access controller 2033 sends the read command to 
the local memory. The content of the register 2052 
is added by +4 by a +4 adder 2053, and then 
supplied via the line I 2060 to the local memory 
until either the received data is found out, or all of 



the regions into which the received data group on 
the local memory have been retrieved, and simulta- 
neously the read demand is derived from the mem- 
ory access controller 2033. 

5 When the received data is found out, the cor- 

responding tag information "1" is sent from the 
local memory 2003 (Fig. 24) via the line I 2061 to 
the register 2058. When the tag information "1" is 
set in the register 2058, the invalidating circuit 

io 2059 is initialized, and a tag invalidating signal is 
transferred, by which the tag corresponding to the 
address for reading the received data of the local 
memory 2003 (Fig. 24) is invalidated, i.e., "0". 
Furthermore, completion of the read demand is 

75 reported via the OR gate 2054 and line I 2063 to 
the memory access controller 2033 (Fig. 24), and 
the condition code register cc 2019 (Fig. 24) is set 
via the line 1 2039. Then, a signal 2055 for invali- 
dating the tag held in the register 2055 is sent via 

20 the line 12033 to the local memory. In addition, the 
received data is sent via the general-purpose regis- 
ter, the lower (32-L) bits of the address retrieved at 
this time are cut out, and further is shifted by 2 bits 
in the right direction in the shifter 2057, and there- 

25 after is transferred as the sub-identifier SK via the 
line I 2029 to the designated general-purpose reg- 
ister. 

When all of the regions where the received 
data group is stored have been retrieved (namely, 

30 no send data has been found out), a judgement is 
made by the column increase of (32-L) column in 
the +4 adder circuit, completion of the read de- 
mand with respect to the memory access controller 
2033 (Fig. 24) is announced via the OR gates 2054 

35 and line I 2031, and thereafter "0" is set in the 
condition code register cc 2019. That is, no re- 
ceived data has been found out. 

Thus, one receive instruction has been ex- 
ecuted. 

40 After completion of this receive instruction, the 

instruction fetch 2031 reads out the well known 
"branch on condition instruction" from the memory 
2003 in order to judge whether or not the data 
prepared for the subsequent instruciton could be 

45 successfully received. Then, the instruction fetch 
2031 performs the branch on condition instruction, 
and branches the above-described received in- 
struction if the content of the condition code regis- 
ter 2019 is equal to "0". If the content of the 

50 condition code register 2019 is. equal to 1, the 
instruction series succeeding this condition branch 
instruction is read from the memory 2003 and then 
executed. This instruction series contains the in- 
struction series so as to calculate the received 

55 data. For instance, this condition corresponds to 
the data having the. maximum value, Le., the same 
main identifier which is retrieved from the data 
within the same group. The reason why both the 
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received data and sub-identifier are received during 
the execution of the received instruction, is such 
that the sub-identifier SK is employed as the num- 
ber for identifying the data having the maximum 
. value. 

The scalar instruction series for retrieving the 
maximum value will now be summarized. A deter- 
mination is made that the number of one of the 
general-purpose register in the general-purpose 
register group 2015 is set to the maximum of GR4 
storage, the other general-purpose register (this 
number is G5) is set to the sub-identifier storage 
having the maximum value thereof, and the initial 
values of these general-purpose registers are set to 
n Q". As the receive instruction and the subsequent 
instruction of the branch on condition instruction, 
both the received data in the general-purpose reg- 
ister having the number GR4 are compared with 
each other in the scalar calculator 2016. Then, the 
instruction series is employed so as to execute 
such processes that the data having the larger 
value is stored into the general-purpose register 
having the number (GR1 + 1); both the sub- 
identifier SK for the received data in the general- 
purpose register having the number GR5 and one 
of the sub-identifiers SK for the general-purpose 
register having the GR5 are selected in response 
to the above-described comparison result and 
thereafter stored into the general-purpose register 
having the number GR5. 

After this instruction series is performed, the 
total number of the received data is counted and a 
"branch on count register" is executed to perform 
the branch, depending upon the condition whether 
or not this total number has reached a predeter- 
mined element number. In other words, the re- 
quired element number of the received data is 
previously stored in the general-purpose register 
having the number GR6, this element number is 
counted down by 1 during the instruction execu- 
tion, and also if this value is not equal to 0, the 
control process is jumped into the instruction of the 
address previously stored in the general-purpose 
register designated by this instruction. Since this 
address is used as the address of the above- 
described receive instruction, the receive instruc- 
tion is again executed in case that the received 
element number has not yet reached a predeter- 
mined element number of the desired received 
data. 

As previously described, according to the 
above- described receive instruction, a plurality of 
data designated by the main identifier MK from the 
local memory 2003 can be read out from the local 
memory irrelevant to the value of the sub-identifier 
SK, and the calculation for the data read by the 
receive instruction is carried out while the subse- 
quent data are transferred from the data transfer 



network into the local memory 2003. Thus, accord- 
ing to the present preferred embodiment, the data 
can be read from the local memory and processed 
irrelevant to the difference in the sub-identifier SK. 
5 Other receive instructions employed in the par- 

allel computer according to the present preferred 
embodiment include formats of RECEIVE VR1, 
GR1, and GR3. It should be noted that the receive 
instructions GR2 and GR3 are the same as those 
10 of the instructions, i.e., the main identifier MK and 
the main identifier MK and the main identifier 
length 1 representing the length thereof. VR1 de- 
notes the number of the vector register for storing 
the received data in response to this receive in- 

75 struction. That is, this instruction implies that the 
main identifier MK reads from the local memory 
2003, the coincident data by way of the receive 
controller 2014, and the read data is stored into the 
vector register having the number of VR1. In this 

20 time, the sub-identifier SK which has been attached 
to the data read out from the local memory 2003 is 
generated by the receive controller 2014 in the 
similar manner to the previous case, and is used 
for designating the data storage position within the 

25 vector register. 

An operation of the apparatus during the ex- 
ecution of this instruction will now be described. 

When this instruction is stored into the instruc- 
tion register 2030, the instruction controller 2013 

30 sends the vector register number VR1 designated 
by this instruction via the line I 2035 for the vector 
process purpose, and also both the main identifier 
MK and main identifier length "L" are sent from the 
general-purpose register group 2015 to the receive 

35 controller 2014, which is similar to the first receive 
instruction, in order to retreive the local memory. 
When the data corresponding to the identifier MK 
is read out, this data and sub-identifier SK attached 
thereto are sent via the respective lines I 2025 and 

40 I 2027 to the vector process unit 2017. Referring 
now to Fig. 27, this vector process unit 2017 is 
constructed of the vector register group 2070, vec- 
tor calculator 2071 , local memory 2003 (Fig. 24), 
vector calculator 2071 , selector 2077 for selecting 

45 the vector register into which the vector data sup- 
plied from the receive controller 2014 should be 
written, selector 2008 for selecting the vector regis- 
ter by which the vector data should be supplied to 
the vector calculator 2071, write circuit 2071 pro- 

50 vided with the respective vector registers, and 
readout circuit 2071. In Fig. 27, there are shown 
only the write circuit 2071 W and readout circuit 
2071 R for the vector register 2070-1. This write 
circuit 2071 W is arranged by a WA register 2072 

55 for holding the write address; +1 up-counter circuit 
2074, and a selector 2006 for selecting an input 
from the line I 2043 and an output from the + 1 up- 
counter circuit 2074 so as to supply an output to a 
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WAP register 2002. The readout circuit 2071 R is 
constructed of an RA register 2073 for holding the 
readout address and a + 1 up-counter circuit 2075 
for counting up this value by + 1. 

When the above-described receive instruction 
is performed, the vector register number VR1 for 
designating this instruction supplied from the in- 
struction executing controller 2013 (Fig. 24) via the 
line I 2080 is input into the selector 2077, and the 
data read out from the receive controller 2014 (Fig. 
24) on the line I 2028 is sent to the vector register 
having the number of VR1. It is assumed that the 
vector register 2070-1 coresponds to the vector 
register having the number of VR1 instructed by 
the above-described receive instruction. At this 
time, the write circuit 2071 W related to this vector 
register 2070-1 is initialized by an instruction de- 
coding unit 2013, and simultaneously the selector 
2078 selects the input supplied from the line I 
2043. As a result, the sub-identifier SK output from 
the receive controller 2014 to the line I 2043 is set 
into the WA register 2072. so that the data supplied 
from the line I 2044 is written into the memory 
device of the vector register 2070-1 corresponding 
to the sub-identifier SK. As apparent from the fore- 
going descriptions, the number which is allocated 
to the vector data as the main identifier MK is 
employed, and also the number which is allocated 
to the respective elements in this vector data as 
the sub-identifier SK is employed, so that the re- 
ceived data (vector element) can be written in one 
of the vector registers. 

It should be noted that during the execution of 
the receive instruction, the fact whether or not there 
is data for changing the main identifier MK is 
reflected to the condition code register 2021 (Fig. 
24) which is similar to the operation of the receive 
instruction, as described in the first description. 
Similarly, the operations where the branch on con- 
dition instruciton is executed so as to change the 
above-described receive instruction, ■ and the 
above-described receive instruction is again per- 
formed during a failure in the data reception, are 
the same as those in the execution of the receive 
instruction, as described in the beginning. 

After the above-described receive instruction is 
executed, to judge whether or not the required 
vector element number has been received, the 
same branch on count instruction is employed and 
then the above-described reception instruciton is 
carried out by necessiary times. 

As a consequence, the desired number of the 
vector elements can be stored into one vector 
register. Thereafter, by executing either one in- 
struction by which the vector calculation instruction 
or vector data is stored into the memory 2003 (Fig. 
24), or the other instruction by which the vector 
data is loaded from the memory 2003, the data 



process for the received vector data can be per- 
formed. 

It is also possible that a receive instruction 
other than the above-described two types of the 
5 receive instructions may be performed. For in- 
stance, the data may be stored into either a 
general-purpose register or a register other than 
the vector register (e.g., a floating-point register not 
shown). 

10 As is apparent from the above-described de- 

scriptions, the identifier (main identifier MK) at- 
tached to the respective data is employed in the 
sixth preferred embodiment, whereby a plurality of 
data belonging to this identifier can be fetched 

75 from the local memory. As a result, to this end, the 
idea according to the sixth preferred embodiment 
may be applied to such a case that no sub-iden- 
tifier is attached to the data. In addition, if the main 
identifier length "L" is constant, there is no need 

20 that an identifier length L for a retrieval purpose is 
supplied from the process apparatus 2005 to the 
receive controller 2014. However, as in the present 
preferred invention, when the identifier length is 
designated, the same receive controller 2014 may 

25 be utilized in case of various main identifier, length. 



PARALLEL COMPUTER ACCORDING TO A SEV- 
ENTH PREFERRED EMBODIMENT 

30 

Referring now to Fig. 28. a parallel computer 
according to a seventh preferred embodiment of 
the invention will be described. It should be noted 
that the same reference numerals shown in the 
35 sixth preferred embodiment (Fig. 24) will be em- 
ployed as those for denoting the same circuit ele- 
ments shown in Ftg. 28. 

A different point between the sixth and seventh 
p re f errec | embodiments is to employ an address 
40 generating unit 2018 in the processor element at 
the data reception side not in the processor ele- 
ment at the data transmission side. 

In accordance with the parallel computer of the 
seventh preferred embodiment, a message 2063 is 
45 arranged by a destinated processor element num- 
ber,, a length "L" of a main identifier, a main 
identifier MK. a sub-identifier SK and data in the 
data transmission process. 

This message is stored via a data transfer 
so network 2001 into a receive buffer 2064 in a re- 
ceive processor element. The main identifier length 
"L", main identifier MK and sub-identifier SK 
among the message stored into the receive buffer 
2064 are transferred to the address generating unit 
55 2018. in this address generating unit 2018, the 
write address of the receive data on the local 
memory is generated. 
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PARALLEL COMPUTER ACCORDING TO AN 
EIGHTH PREFERRED EMBODIMENT — 

A parallel computer according to an eighth 
preferred embodiment of the invention will now be 
described with reference to Fig. 29. 

It should be noted that the same reference 
numerals shown in the sixth preferred embodiment 
(Fig. 24) will be employed as those for indicating 
the same circuit elements shown in Fig. 29. 

A different point between the sixth and eighth 
preferred embodiments is to employ a receive 
memory. In the parallel computer according to the 
eighth preferred embodiment a receive memory 
having a tag is exclusively employed. In a local 
memory 2009, there is no tag. 

With respect to the receive memory, both the 
write process from a receive buffer 2004 and also 
the readout process from a receive controller 2014 
are employed. 

an address 2004-1 among the message stored 
into the receive buffer 2004 indicates an address of 
a receive memory 2010, and receive data 2004-2 is 
written into an address 2004-1 of the receive mem- 
ory 2010 in response to a write demand supplied 
by a write controller 2031 . 

In the data reception process, on the other 
hand, by the initialized receive controller 201 4 and 
read controller 2035, the receive data is read from 
the receive memory 2010 in a similar manner that 
the receive data is read out from the local memory 
in the sixth preferred embodiment. 

There are particular advantages according to 
the eighth preferred embodiment as follows. 

(1) The tag is not required to be attached to 
all memory regions of the local memory, as effec- 
ted in the sixth preferred embodiment 

(2) Since the receive memory 2010 is sepa- 
rated from the local memory 2009, there is no 
competition between the write demand derived 
from the receive buffer 2004 for the receive mem- 
ory 2010, and the access demand for the local 
memory by the normal instruction derived from the 
instruction process unit 2006. 

(3) The address of the local memory is em- 
ployed for processing the data in the processor 
element during the program formation. For the data 
communication between the processor elements 
the receive address generated by utilizing the main 
identifier MK, sub-identifier SK and main identifier 
length "L" is employed, so that the address of the 
processor element calculation process can be sep- 
arated from the address of the data communication 
between the processor elements. As a conse- 
quence, the program can be easily formed. 

In the eighth preferred embodiment, the ad- 
dress space of the local memory 2009 was com- 



pletely separated from the address space of the 
receive memory. However, in view of the hard ware 
realization, the address space of the receive mem- 
ory may be realized as a part of the address space 

5 of the local memory 2009. For instance, a predeter- 
mined region of the local memory from the address 
previously designated is used as a receive memory 
area, and a tag is employed as a word unit within a 
receive memory region. 

io These arrangements can be realized by the 

following conditions. That is, when the send ad- 
dress is generated, and the receive address is 
produced in the receive controller 2014, the head 
address of the receive memory is added to these 

75 send address and receive address. 

Fig. 30 illustrates an address generating unit 
2018 according a preferred embodiment. In this 
figure, the same reference numerals shown in Fig. 
25 are employed as those for denoting the same 

20 circuit elements. In the address generating unit 
2018, a register 2044 and an adder 2045 indicating 
a head address of the receive memory are em- 
ployed. 

Fig. 31 illustrates a receive controller 2014 
25 according to a preferred embodiment. In this figure, 
the same reference numerals shown in Fig. 26 will 
be employed as those for denoting the same circuit 
elements. In the receive controller 2014, a register 
2068 and an adder 2069 for representing a head 
30 address of the receive memory are newly em- 
ployed. 

Since these address generating unit and re- 
ceive controller are modified, there is no need to 
recognize the realizing area of the receive memory 
35 on the local memory 2009 in view of the program. 



PARALLEL COMPUTER ACCORDING TO A 
NINTH PREFERRED EMBODIMENT 

40 

Referring now to Fig. 32, a parallel computer 
according to a ninth preferred embodiment of the 
invention will be described. In the ninth preferred 
embodiment, as an identifier attached to data to be 

45 sent, a main identifier indicative to this data, and 
also a sub-identifier for discriminating this data 
from other data in this data group are employed, 
and the main identifier retrieves the same data so 
as to read the data from the local memory, which is 

so similar to those in the sixth preferred embodiment. 

However, in the previous six preferred embodi- 
ment, one piece of data was transferred to the 
other processor element in response to one send 
instruction, and one piece of data was read from 

55 the asociative memory device in response to one 
receive instruction. In the ninth preferred embodi- 
ment, to the contrary, one group of data is sent to 
the other processor element in response to one 
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send instruction, and a plurality of data is read out 
from the local memory in response to one receive 
instruction. A detailed operation relating to the 
above-described featured operations will now be 
described. 

It should be noted that the same reference 
numerals shown in Fig. 24 will be employed as 
those for indicating the same, or similar circuit 
elements shown in Fig. 32. As circuit arrangements 
different from those shown in Fig. 24, reference 
numeral 2090 is a receive unit, reference numeral 
2091 denotes a send unit, reference numeral 2093 
indicates a receive controller, reference numerals 
2099 and 2096 represent counter circuits, refer- 
ence numerals 2095 and 2097 denote control cir- 
cuits, and reference numerals 2092 denotes a 
memory control. 

The receive unit 2090 and send unit 2091 are 
operated independent from the instruction process 
unit 2006 under the control of the respective con- 
trol circuits 2095 and 2097. As these control cir- 
cuits 2095 and 2097, a microprocessor may be, for 
instance, employed. The memory controller 2092 is 
realized by modifying the memory controller 2007 
shown in Fig. 24. The functions of this memory 
controller 2092 are three, i.e., the write operation 
by the receive buffer 2004 at the access end. the 
read operation by the receive unit 2090, and the 
write/read operations by the instruction process 
unit The receive controller 2093 is obtained by 
modifying the receive controller 2014 shown in Fig. 
24. 

First, the data send process will be described. 
The data send instructions employed in the parallel 
computer according to the ninth preferred embodi- 
ment are as follows: 
SEND, VR1, GR2, GR3, and GR4. 
Where SEND is an ope-code, VR1 indicates the 
number of the vector register to be sent, GR2 to 
GR4 indicate the numbers of the general-purpose 
registers for storing the main identifier MK with 
respect to the data to be transferred, the vector 
length VL and the destinated processor element 
number thereof. In addition, the main identifier 
length L is previously held in the general-purpose 
register having the number of (GR2) + 1 . 

When this send instruction is set in the instruc- 
tion register, the instruction decoding unit 2013 
decodes the instruction; the destinated processor 
element number as the content of the above-de- 
scribed register is sent to the send buffer 2005; 
similarly, the main identifier MK and main identifier 
length L as the contents of the general-purpose 
register are sent to the address generating unit; the 
vector length VL is sent via the line I 2101 to the 
counter circuit 2096; the number of the vector 
register for holding the vector data to be sent is 
sent to the vector process unit 2017, and further- 



more the instruciton controller 2013 initializes the 
control circuit 2097 and vector process unit 2017. 
Thereafter, this instruction controller 2013 com- 
mences the decoding operaiton for the subsequent 

5 instruction. 

The control circuit 2097 resets the counter 
circuit 2096 via the line I 2102. From the vector 
process unit, the vector data are successively out- 
put by one element, and then set via the line I 

jo 2022 into the send buffer. Furthermore, from the 
vector process unit, both the data and the element 
number are sent via the line I 2100, as the sub- 
identifier SK, to the address generating unit 2018. 
In the address generating unit 2018, the address 

75 on the local memory in the destinated processor 
element and three pieces of input information are 
generated, and sent to the send buffer 2005. From 
the send buffer, the message is generated every 
element of the vector data, and then transferred to 

20 the data transfer network. 

Every time the data is sent, the counter circuit 
performs its counting operation, and announces the 
end of the data transmission via the line I 2103 to 
the control circuit 2097 when the counting opera- 

25 tion is repeated VL times. As illustrated in. for 
example, Fig. 33, the counter circuit 2096 is ar- 
ranged by two registers, a <+i) adder, and a 
comparator circuit. 

There are two methods for announcing the end 

30 of the send instruction execution of the instruction 
controller 2013. As the first method, the control 
circuit 2097 interrupts the instruction controller 
2013. As the second method, the instruction for 
regularly checking a condition of the data transmis- 

35 sion device is issued in the instruction processing 
device 2006. As such an instruction, the following 
TEST SEND instruciton may be considered. 
TSEND. 

TSEND corresponds to an ope-code of TEST 
40 SEND instruction. If the data transmission process 
is not yet accomplished. M 1" is set mto the con- 
dition code register cc 2019, whereas if the data 
transmission process is completed, "0" is set into 
this register. 

45 The instruciton controller 2013 checks the con- 

tent of this condition code register cc 2019 so as to 
grasp the conditions of the data send device 2019. 

The features according to the ninth preferred 
embodiment are such that the sub-identifier SK is 

so successively produced together with the data, and 
moreover the address of the local memory in the 
destinated processor element is produced by utiliz- 
ing the main identifier MK designated by the in- 
struction, with the result that one data group is 

55 transferred in response to a single instruction. 

The means employed in the ninth preferred 
embodiment is similar to that of the first preferred 
embodiment, where the message data present in 
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the data transfer network 2012 is stored in the local 
memory 2003 in the destinated processor element. 

Next, the data reception process will be de- 
scribed. The receive instrucitons employed in the 
ninth preferred embodiment are as follows: 
RECEIVE, VR1, GR2 and QR3. 

It should be noted that RECEIVE is an ope- 
code of this instruction, VR1 is the number of the 
vector register to be received. GR2 and GR3 are 
the numbers of the general-purpose registers for 
storing the main identifier MK with respect to the 
data to be received, and the vector length VL 
thereof. In addition, the main identifier length L is 
previously held in the general-purpose register hav- 
ing the number of (GR2) + 1. 

When this receive instruction is set in the in- 
struciton register 2030, the instruciton controller 
2013 sends the main identifier MK and main iden- 
tifier length L as the contents of the above-de- 
scribed register to the receive controller 2093; 
sends the vector length VL to the counter circuit 
2094; sends the number of the vector register for 
storing the received vector data, and furthermore 
the instruction controiler 2017 initializes both the 
control circuit 2095 and vector process unit 2017. 
Thereafter, the instruction controller 2013 commen- 
ces the execution of the decoding operation for the 
next instruction. 

The control circuit 2095 resets the counter 
circuit 2094. Subsequently, the receive controller 
2093 is initialized. Although the receive controller 
2093 is similar to the receive controller (Fig. 26) 
according to the sixth preferred embodiment, a 
different point exists in that all of the data groups 
required for processing the first receive instruction 
are received in the ninth preferred embodiment. As 
a result the ( + 4) adder 2053 (Fig. 26) does not 
perform the count-up operation until the data for 
the address held in the register 2052 (Fig. 26) is 
received. In other words, when the data for the 
address stored in the register 2052 (Fig. 26) is 
received, the ( + 4) adder 2053 (Fig. 26) is initial- 
ized, the address in the register 2052 (Fig. 26) is 
updated, and the subsequent data is received. At 
the same time, the counter circuit 2094 is initialized 
so that the counter counts up by +1. The con- 
struction of the counter circuit 2094 is the same as 
that of the counter circuit 2096. 

The data read by the receive controiler 2093 is 
transferred together with the sub-identifier SK to 
the vector processing unit 2017, and then stored in 
the vector register designated by the receive in- 
struciton. The sub-identifier SK is used as the' 
element number at this moment. 

Every time the counter circuit 2094 receives 
the data, the counting operation is carried out. and 
the end of the data reception is announced to the 
control circuit 2095 when the counting operation is 



repeated VL times corresponding to the vector 
length. 

There are two methods as to the announce- 
ment of the end of the received instruction control- 

5 ler to the instruction controller 2013, which is simi- 
lar to the methods of the data transfer device. As 
the first method, the control circuit 2095 interrupts 
the instruction controller 2013. As the second 
method, an instruction for regularly checking the 

w conditions of the data receiving device is issued in 
the instruction processing device 2006. As such an 
instruction, the following TEST RECEIVE instruc- 
tions are for instance, conceived: 
TRCV. 

is TRCV is an ope-code of the TEST RECEIVE 

instruction. If the data reception process is not yet 
accomplished, "1" is set in the condition code 
register cc 2019, whereas if the data reception 
process is completed, "0" is set therein. The in- 

20 struction controller 2013 can recognize the con- 
ditions of the data receive device by checking the 
contents of this condition code register cc 2019. 

According to the present invention, in the par- 
allel computer constructed of a plurality of proces- 

25 sor elements including the local memories, where 
the data is written from the other processor ele- 
ment to the local memory, the overhead occurring 
in the data communication between the processors 
can be considerably reduced, so that the parallel 

30 computer can have a higher performance. 

Also, according to the present invention, the 
data required for such a process where the ex- 
change rule can be satisfied, can be fetched in the 
receive processor in the order of the data which 

35 have arrived at the local memory, so that the time 
period during which the data receive processor is 
in the rest condition can be shortened. 

In addition, since the boundary for dividing the 
identifiers into two can be freely determined ac- 

40 cording to the parallel computer of the invention, 
the identifier having a limited length can be effi- 
ciently utilized. 

Furthermore, a confirmation on the data recep- 
tion from a plurality of processor elements can be 

45 effected by only one instruction within one time. As 
a consequence, even if a plurality of processor 
elements and also a plurality of data are sent and 
received, only several instrucitons are required. Ac- 
cordingly, it can prevent the lower efficiency of the 
so parallel process by performing a large quantity of 
instruction processes with respect to the data com- 
munication according to the present invention. 

55 Claims 

1. A parallel computer (Figs. 1, 6, 8, 13) com- 
prising: 
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(a) a plurality of processor elements (1-1 to 
1-n) connected to each other by a network (2); 

(b) each of said processor elements includ- 
ing a local memory (6) for holding a program and 
data related thereto, a processor (3) for performing 
an instruction in said program, means (5) for trans- 
ferring the data to the other processor elements, 
and means (4) for receiving the data sent from the 
other processor element; 

(c) a memory area (92:8) constructed of a 
plurality of reception" data areas for temporarily 
storing data received by said receiving means, and 
a memory area (92,8) constructed of a plurality of 
tag areas, provided for each of the reception data 
areas, for storing a valid data tag or an invalid data 
tag indicating that the data in the corresponding 
reception data area is valid or invalid; 

(d) a transmitting means (5) for transmitting 
the data to be transmitted with attaching a data 
identifier predetermined by said data; 

(e) a receiving means for writing the data 
into one of said plurality of reception data areas in 
response to the data received from said network, 
and writing the valid data tag into one of said 
plurality of reception data areas, said receiving 
means being parallel-operated with said processor, 
and 

(f) an access means (38) for reading both 
the data and tag from one of the reception data 
areas determined by said data identifier and from 
the corresponding tag areas in response to the 
data identifier designated by the instruction which 
is produced from said program for requiring the 
data reception, and for repeatedly reading the tag 
and data from the tag area and reception data area 
until the valid data tag is read out from the tag area 
in case that the read tag corresponds to the invalid 
data tag. 

2. A parallel computer (Figs. 1, 18. 13) as 
claimed in claim 1, wherein said plurality of recep- 
tion data areas correspond to a portion of memory 
areas of said local memory. 

3. A parallel computer (Fig. 6) as claimed in 
claim 1, wherein said plurality of reception data 
areas are provided within a receive memory (8) 
which has been accessed independent from said 
local memory. 

4. A parallel computer (Figs. 1, 6, 13) as 
claimed in claim 1, wherein a data identifier sent 
from said transmitting means is equal to an ad- 
dress of one reception data area within a des- 
tinated processor element for storing the data sent 
from said transmitting means. 



5. A parallel computer (Fig. 1) as claimed in 
claim 4, wherein said data identifier designated by 
said data reception demand instruction is equal to 
an address of one of said reception data areas 

5 accessed by said access means. 

6. A parallel computer (Fig. 1) as claimed in 
claim 5, wherein said plurality of reception data 
areas provided with each of said processor ele- 
ments are equal to a portion of areas of said local 

10 memory. 

7. A parallel computer (Fig. 6) as claimed in 
claim 5, wherein said plurality of reception areas 
provided with each of said processor elements are 
employed in the receive memory (8) which has 

75 been addressed different from said local memory. 

8. A parallel computer (Fig. 13) as claimed in 
claim 4. wherein said processor includes a means 
(34) for supplying both data to be sent, and a data 
identifier to said transmitting means in response to 

20 an instruction for demanding the data transmission, 
said data identifier being different from the address 
of one of said reception data area within said 
destinated processor element, which has been pre- 
viously determined, and said transmitting means 

25 includes a transforming means (120) for producing 
as a data identifier to be sent an address of one of 
said reception data area within said destinated pro- 
cessor element from said suppplied data identifier. 

9. A parallel computer as claimed in clajm 8. 
30 wherein said data identifier supplied from sajd pro- 
cessor to said transmitting means corresponds to a 
difference address between the address of one of 
said reception data areas within said destinated 
processor element and a head address of said 

35 plurality of reception data areas withm said des- 
tinated processor element, and said transform 
means includes an adder menas for adding said 
supplied data identifier and said head address. 

10. A parallel computer (Fig. 13) as claimed in 
40 claim 8, wherein the data identifier designated by 

said data reception demand instruction is different 
from the address of one of said reception data area 
accessed by said access means, and said proces- 
sor includes a transform means (110) for generat- 
es ing an address of one of said reception data areas 
accessed by said access means from said data 
identifier designated by said data reception de- 
mand instruction. 

11. A parallel computer (Fig. 3) as claimed in 
so claim 10, wherein said data identifier designated by 

said data reception demand instruction corre- 
sponds to a difference between the address of one 
of said data areas accessed by said access means 
and a head address of a plurality of received data 
55 areas containing one of said received data areas. 
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12. A parallel computer (Fig. 13) as claimed in 
claim 11, wherein said plurality of received data 
areas employed in the respective processor ele- 
ments correspond to a portion of said local mem- 
ory area employed in said processor element. 

13. A parallel computer (Fig. 8) as ciaimed in 
claim 1 , wherein said data identifier sent from said 
transmitting means is different from an address of 
one of said received data areas within the des- 
tinated processor element, which should store the 
data sent from said transmitting means. 

14. A parallel computer (Fig. 8) as claimed in 
claim 13, wherein said data identifier sent from said 
transmitting means is a difference between the 
address of one of said received data areas within 
said destinated processor element and a head ad- 
dress of said plurality of received data within said 
destinated procesor element, and said receive 
means includes an adder means (102 — Fig. 10) 
for adding the head address of said plurality of 
received data areas to said identifier attached to 
the received data. 

15. A parallel computer (Fig. 14) as claimed in 
claim 13, wherein said data identifer sent from said 
transmitting means is a number applied to one of 
said received data areas within the received data 
areas provided in the destinated processor ele- 
ment, and said receive means includes a means 
(104 — Fig.14) for producing an address of one of 
said received data areas from a received data area 
number representative of said data identifier at- 
tached to the received data. 

16. A parallel computer (Fig. 8) as claimed in 
claim 13, wherein said data identifier designated by 
said data reception demand instruction is different 
from the address of one of said received data 
accessed by said access means, and said proces- 
sor includes a transform means (110) for generat- 
ing the address of one of said received data area 
accessed by said access means in response to 
said data identifier designated by said data recep- 
tion demand instruction. 

17. A parallel computer (Fig. 8) as claimed in 
claim 14 wherein said data identifier designated by 
said data reception demand instruction corre- 
sponds to an address difference betweeen the ad- 
dress of one of said received data area accessed 
by said access means and the head address of a 
plurality of received data area containing one of 
said received data area, and said processor in- 
cludes a means (114 — Fig. 11) for generating the 
address of one of said received data area acces- 
sed by said access means by adding saitf head 
address to the address difference designated by 
said data identifier. 

18. A parallel computer (Fig. 15) as claimed in 
claim 1 5, wherein said data identifier designated by 
said data reception demand instruciton corre- 



sponds to a number attached to one of said re- 
ceived data areas accessible by said access 
means, and said receive means includes a means 
(114 — Fig. 15) for generating the corresponding 
5 address of one of said received data area from a 
received data area number indicative of said data 
identifier attached to the received data. 

19. A parallel computer (Fig. 18) as ciaimed in 
claim 1 , wherein said processor includes: 

w a means (1303 — Fig, 17, 1307 — Fig. 18) for 
supplying each element of the vector data to be 
transferred to said send means in response to the 
vector data transfer demand instruciton, and 
a means (1306) for generating a data identifier 

is determined by each element of the vector data to 
be sent in response to the vector data transfer 
demand instruction. 

20. A parallel computer (Fig. 18) as claimed in 
claim 1 , wherein said processor includes: 

20 a means for successively generating a data iden- 
tifier determined based upon each element of the 
vector data to be received in response to the 
vector data reception demand instruction, and 
a means for sequentially reading the received data 

25 determined by the respective generated data iden- 
tifiers and the corresponding tag area. 

21. A parallel computer (Figs. 24. 28, 29. 32) 
as claimed in claim 1, wherein the data identifier 
sent by said transmitting means is constructed of a 

30 main identifier commonly provided to a plurality of 
data, and a sub-identifier specific to each data. 

22. A parallel computer constructed of a local 
memory, a plurality of processor elements inde- 
pendently operable, and a network for connecting 

35 said plurality of processor elements, comprising: 
a validating means for providing a tag with a por- 
tion or all of words of said local memory positioned 
in said plurality of processor elements, said tag 
attached to said word representing whether data 

40 held by said word is valid or invalid, and for setting 
that a content of said tag attached to said tag is 
valid when the data is written from an arbitrary 
processor element to the word: and. 
an access means for checking the content of said 

45 tag attached to said word, and for repeating a 
check of said word until said tag indicates validity. 

23. A parallel computer as claimed in claim 22. 
further comprising: 

a logic address designated when the data is written 
so into a local memory in the other processor ele- 
ment; 

a writing address transforming circuit for transform-, 
ing the logic address into a real address when the 
data is written into the local memory of the other 
55 processor element, said real address attached to 
the local memory being different; and, 
a reading address transforming circuit for trans- 
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forming said logic address into the real address 
when reading the data which has been written by 
utilizing said writing address transforming means. 

24. A parallel computer as claimed in claim 22, 
further comprising: 

an address generating means for sequentially gen- 
erating a designating address when the address 
used during the data writing into the local memory 
of the other processor element, and the data are 
read out. 

25. A parallel computer wherein there are pro- 
vided a local memory, a plurality of processor 
elements independently operable, and a network 
for connecting said plurality of processor elements, 
comprising: 

a receive memory capable of writing from an ar- 
bitrary processor element of said plurality of pro- 
cessor elements via said network; 
a validating means for providing a tag with a por- 
tion or all of words constituting said receive mem- 
ory, said tag attached to said word representing 
whether said word is valid or invalid, and for setting 
that a content of said tag attached to said tag is 
valid when the data written from an arbitrary pro- 
cessor element to the word; and, 
a tag access means for checking a content of the 
tag attached to said word, and for repeating a 
check of said tag until said tag represents validity. 

26. A parallel computer comprising: 

(a) a plurality of processors, and 

(b) a network for transferring data between 
said plurality of processors, 

(c) each of said processor including: 

(d ) a local memory for holding a program or data 
in which a tag is attached thereto in a word unit; 
(c2) a first means for transmitting to said network a 
message containing an address of a local memory 
in a destinated processor, said address being gen- 
erated from both a main identifier for discriminating 
data to be sent to the other processor from a data 
group to which said data belongs; 
(c3) a second means for writing a plurality of mes- 
sages supplied from said network to said processor 
into the local memory based upon the address 
contained in said message, and for simultaneously 
validating the corresponding tag; and, 
(c4) a third means for generating an address of a 
received data group in response to an instruction 
for demanding readout of the received data from 
said local memory based upon a retrieval main 
identifier designated from said instruction, for read- 
ing the desirable received data from said local 
memory and further for generating a sub-identifier 
corresponding to said received data. 

27. A parallel computer comprising: 
(a) a plurality of processors, and 



(b) a network for transferring data between 
said plurality of processors, 

(c) each of said processor including: 

(c1) a local memory for holding a program or data 
5 in which a tag is attached thereto in a word unit; 
(c2) a first means for transmitting to said network a 
message containing an address of a local memory 
in a destinated processor, said address being gen- 
erated from both a main identifier for discriminating 
io data to be sent to the other processor from a data 
group to which said data belongs; 
(c3) a second means for generating an address of 
the local memory from said main identifier and 
sub-identifier contained in a plurality of messages 
75 supplied from said network to said processor, for 
fetching the received data into said address, and 
for simultaneously validating the tag corresponding 
to said address; and, 

(c4) a third means for generating the address of 
20 the received data group in response to an instruc- 
tion for demanding readout of the received data 
from said local memory based based upon a re- 
trieval main identifier designated from said instruc- 
tion, for reading the desirable received data from 
25 said local memory, and further for generating a 
sub-identifier corresponding to said received data. . 

28. A parallel processor as claimed in claim 26 
or 27 further comprising: 

30 a means for generating an address on said local 
memory by utilizing said main identifier and sub- 
identifier, and also information representative of an 
effective length of said main identifier. 

29. A parallel processor as claimed in claim 26. 
35 wherein said local memory is constructed of a 

region having no tag for holding the program or 
data, and a region holding the received data group 
and having the tag in a unit of word. 

30. A parallel computer comprising: 
40 (a) a plurality of processors, and 

(b) a network for transferring data between 
said plurality of processors, 

(c) each of said processor including: 

(d) a local memory for holding a program or data 
45 in which a tag is attached thereto in a word unit: 

(c2) a second means for sequentially reading the 

instruction from said local memory; 

(c3) a third means independently operated from 

said second means, for performing data transmis- 
50 sion based upon a data group designated by said 

second means, a main identifier for closing said 

data group, and a number of the transmitted data; 

and, 

(c4) a fourth means independently operated from 
55 said second means, for performing data reception 
based upon the data group designated by said 
second group, a main identifier indicative of said 
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data group, and a number of the received data. 
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